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Preface 


This volume contains the proceedings of the 16th International Conference on 
Semantic Systems (SEMANTICS 2020). SEMANTICS offers a forum for the exchange 
of latest scientific results in semantic systems and complements these topics with new 
research challenges in areas like data science, machine learning, logic programming, 
content engineering, social computing, and the Semantic Web. The conference is in its 
16th year and has developed into an internationally visible and professional event at the 
intersection of academia and industry. Contributors to and participants of the confer- 
ence learn from top researchers and industry experts about emerging trends and topics 
in the wide area of semantic computing. The SEMANTiCS community is highly 
diverse; attendees have responsibilities in interlinking areas such as artificial intelli- 
gence, data science, knowledge discovery and management, big data analytics, 
e-commerce, enterprise search, technical documentation, document management, 
business intelligence, and enterprise vocabulary management. 

The conference’s subtitle in 2020 was “The Power of AI and Knowledge Graphs,” 
and especially welcomed submissions to the following topics: 


Web Semantics and Linked (Open) Data 

Enterprise Knowledge Graphs, Graph Data Management, and Deep Semantics 
Machine Learning and Deep Learning Techniques 

Semantic Information Management and Knowledge Integration 
Terminology, Thesaurus, and Ontology Management 

Data Mining and Knowledge Discovery 

Reasoning, Rules, and Policies 

Natural Language Processing 

Data Quality Management and Assurance 

Explainable Artificial Intelligence 

Semantics in Data Science 

Semantics in Blockchain and Distributed Ledger Technologies 
Trust, Data Privacy, and Security with Semantic Technologies 
Economics of Data, Data Services, and Data Ecosystems 


We additionally issued calls for three special tracks: 


e Digital Humanities and Cultural Heritage 
e Blockchain 
e LegalTech 


Due to the health crisis caused by the COVID-19 pandemic, SEMANTICS 2020 
took place in a highly reduced form. All on-site events were canceled and postponed to 
2021. To keep a minimum level of continuity, the conference chairs decided to keep the 
call for scientific papers open and to publish a selection of reviewed papers as pro- 
ceedings. The authors of accepted papers also provided a video presentation of their 
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contribution, which was made available via the conference’s website. In total, we 
received 36 submissions to the scientific call. 

In order to properly provide high-quality reviews, a Program Committee 
(PC) comprising of 131 members supported us in selecting the papers with the highest 
impact and scientific merit. For each submission, at least four reviews were written 
independently from the assigned reviewers in a single-blind review process (author 
names are visible to reviewers, reviewers stay anonymous). After all reviews were 
submitted, the PC chairs compared the reviews and discussed discrepancies and dif- 
ferent opinions with the reviewers to facilitate a meta-review and suggest a recom- 
mendation to accept or reject the paper. Overall, we accepted 8 papers which resulted in 
an acceptance rate of 22,2%. 

We thank all authors who submitted papers and the PC for providing careful reviews 
in a quick turnaround time. 
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The New DBpedia Release Cycle: 
Increasing Agility and Efficiency 
in Knowledge Extraction Workflows 


Marvin Hofer! ©&®, Sebastian Hellmann!, Milan Dojchinovski!?®, 
and Johannes Frey! 


! Knowledge Integration and Language Technologies (KILT/AKSW), 
DBpedia Association/InfAI, Leipzig University, Leipzig, Germany 
{hofer,hellmann,dojchinovski,frey}@informatik.uni-leipzig.de 
? Web Intelligence Research Group, FIT, Czech Technical University in Prague, 
Prague, Czech Republic 
milan.dojchinovski@fit.cvut.cz 


Abstract. Since its inception in 2007, DBpedia has been constantly 
releasing open data in RDF, extracted from various Wikimedia projects 
using a complex software system called the DBpedia Information Extrac- 
tion Framework (DIEF). For the past 12 years, the software received a 
plethora of extensions by the community, which positively affected the size 
and data quality. Due to the increase in size and complexity, the release 
process was facing huge delays (from 12 to 17 months cycle), thus impact- 
ing the agility of the development. In this paper, we describe the new 
DBpedia release cycle including our innovative release workflow, which 
allows development teams (in particular those who publish large, open 
data) to implement agile, cost-efficient processes and scale up productiv- 
ity. The DBpedia release workflow has been re-engineered, its new primary 
focus is on productivity and agility, to address the challenges of size and 
complexity. At the same time, quality is assured by implementing a com- 
prehensive testing methodology. We run an experimental evaluation and 
argue that the implemented measures increase agility and allow for cost- 
effective quality-control and debugging and thus achieve a higher level of 
maintainability. As a result, DBpedia now publishes regular (i.e. monthly) 
releases with over 21 billion triples with minimal publishing effort. 


Keywords: DBpedia - Knowledge extraction - Data publishing - 
Quality assurance 


1 Introduction 


Since its inception in 2007, the DBpedia project [8] has been continuously releas- 
ing large, open datasets, extracted from Wikimedia projects such as Wikipedia 
and Wikidata [15]. The data has been extracted using a complex software 


Sebastian Hellmann—https:/ /global.dbpedia.org/id/3eGWH. 
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system known as the DBpedia Information Extraction Framework (DIEF). Over 
the past years the system has received a plethora of extensions and fixes by the 
community which resulted in creating monolithic releases. 

Until 2017, The DBpedia release process has been primarily focused on data 
quality and size, however, it neglected other two important and desirable goals: 
productivity and agility (cf. [3] for balancing the magic triangle on quality, pro- 
ductivity and agility The release process was facing massive delays (from 12 to 
17 months) with increasing costs of development and lower productivity due to 
the sole focus on quality and the increased size and complexity. The releases 
were so large and complex that the DBpedia core team failed to produce them 
for almost three years (2017-2019). Note that this was not a performance nor 
scalability related issue. The DBpedia release workflow has been re-engineered, 
its new primary focus is on productivity and agility, to address the challenges 
of size and complexity. At the same time, the quality aspects are assured by 
implementing a comprehensive testing methodology. 

In this paper, we describe the new DBpedia release cycle including our inno- 
vative release workflow, which allows development teams (in particular those 
who publish large, open data) to implement agile, cost-effective processes and 
scale up productivity. As a result of our innovation DBpedia now produces over 
21 billion triples per month with minimal publishing effort. 

The paper is organized as follows. First, in Sect. 2, we summarize the two 
biggest challenges as a motivation for our work, followed by an overview of 
the release workflow described in Sect.3. The main process innovations and 
conceptual design principles are described in Sect. 4. The implemented testing 
methodology is described in Sect.5 and the results from several experiments 
showing the impact, capabilities and the gain from the new release cycle are 
presented in Sect. 6. Section 7 reports on technologies that relate to ours. Finally, 
Sect. 8 concludes the paper and presents future work directions. 


2 Background and Motivation 


1. Agility. Data quality is one of the largest and oldest topics in computer 
science independent of current trends such as Big Data or Knowledge Graphs 
and has a vast amount of facets to consider [16]. Data quality, often defined as 
“fitness for use”, poses many challenges that are frequently neglected or delayed 
in the software engineering process of applications until the very end, i.e. when 
the application is demonstrated to the end-user. In this paper, we will refer 
to this phase of the process as the “point-of-truth” since it marks an important 
transition of data (transferred between machines and software) to information. 
At this point, results are presented in a human-readable form so that humans 
can evaluate them according to their current knowledge and reasoning capacity. 
We argue that any delay or late manifestation of such a “point-of-truth” impacts 
cost-effectiveness of data quality management and stands in direct contradiction 
to the first and other principles of the agile manifesto: “Our highest priority is to 
satisfy the customer through early and continuous delivery of valuable software." 
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Fig. 1. The DBpedia release cycle. 


[2]. Our release cycle counteracts the delay by introducing frequent, fixed time- 
based releases in combination with automated delivery of data to applications 
via the DBpedia Databus (cf. Subsect. 4.1). 


2. Efficiency. We focus on efficiency as a major factor of productivity. Data 
quality follows the Law of Diminishing Returns [11] (similar to Pareto-Efficiency 
or 80/20 rule), meaning that initially decent quality can be achieved quickly, 
while complex errors become increasingly much harder to find and fix, up to a 
point where adding more resources (e.g. human labor or development power) 
produces similar or worse results!. In our experience, there is no exception 
to the law of diminishing returns in data. It affects all data projects, be 
they collaboratively edited such as Wikidata, semi-automatic such as DBpedia 
or fully automated machine learning approaches. Additionally, data quality 
does usually not depend primarily on the effort invested (e.g. by a 
large community) but on the efficiency of the development process 
and the ability to effectively improve data in a sustainable manner. 
Measures to increase efficiency are traceability of errors (Subsect. 4.2) combined 
with testing (Sect. 5). 


3 DBpedia Release Cycle Overview 


'The DBpedia release cycle is a time-driven release process triggered on a regular 
basis (i.e. monthly). The DIEF framework (in a distributed computational envi- 
ronment) is executed and data is extracted on the latest Wikipedia dump. The 
basis of the release cycle relies on the DBpedia Databus platform, which acts 
as a data publishing middleware and is responsible for maintaining information 
about published data by organizing collection of files as groups and artifacts. The 
DBpedia Databus is the core component which helps data publishers to publish 
and promote their data, additionally, it supports data consumers in searching 


1 Meaning overall less output per unit. 
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and retrieving data assets. The published file metadata is stored in the Databus 
repository and is accessible via SPARQL. 


Data Groups and Artifacts. The process creates five core data groups, each 
generated by a different extraction process”: i) generic-information extracted by 
the generic extractors, ii) mappings-information extracted using user specified 
mapping rules, iii) text-extracted Wikipedia article's content and iv) ontology— 
the DBpedia ontology and v) wikidata-extracted and mapped structured data 
from Wikidata [6]. Each data group consists of one or more versioned data 
artifacts which represent a particular dataset in different formats, content (e.g. 
language) and compression variants. In other words, an artifact is a collection of 
multiple files, which can be addressed with a unique Databus identifier. The arti- 
fact IRIs have hierarchical structure and follow a pattern. For example: https:// 
databus.dbpedia.org/dbpedia/mappings/instance-types/2020.04.01. 

Where ‘dbpedia’ refers to a publisher, ‘mappings’ refers to a group, ‘instance- 
types’ refers to an artifact and ‘2020.04.01’ refers to its version. 


Publishing Agents. A publishing agent acts on behalf of a person or orga- 
nization and publishes data on the Databus. A Databus account is created 
and assigned to each agent. The initial set of data groups are published on 
the Databus by the MARVIN publishing agent.? In addition to the MARVIN 
agent, there is also the DBpedia agent, which publishes cleaned data artifacts, 
ie. syntactically valid. The configuration files used to generate the MARVIN 
and DBpedia releases are available as a public git repository.* 


Cleansing, Validation and Reporting. The data (i.e., triples) published by 
the MARVIN agent is then picked up and parsed by the DBpedia agent to 
create strictly valid RDF triples without any violations (including warnings and 
errors) based on Apache Jena?. Finally, syntactically cleaned data artifacts are 
published by the DBpedia agent. While the data is syntactically valid, other 
data quality issues might persist. For example, the IRIs of particular subjects, 
predicates and objects do not conform to a predefined schema, the data can be 
structurally incorrect and does not conform to the ontology restrictions, or the 
release might be incomplete (e.g. missing artifacts). A large-scale validation is 
done for each release and the error reports are delivered to the community for a 
review. Figure 1 depicts the overall DBpedia data release cycle. 

The complete new DBpedia release approach has been deployed in February 
2020. Releases are created every month, except the text group, which is released 
every three months, due to its complexity We have deployed a light-weight 
dashboard (see http://release-dashboard.dbpedia.org/) which summarizes the 
releases, including the extraction process progress, extraction logs and overall 
statistics. Table 1 provides statistics for different DBpedia data groups for three 


? https: //databus.dbpedia.org/dbpedia/. 

3 https: //databus.dbpedia.org/marvin/. 

^ https: //git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config. 
5 https: //jena.apache.org. 
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Table 1. Size metrics (i.e. triples count) for DBpedia data groups and release periods. 


Version Generic Mappings Text Wikidata 
2016.10.01 | 4,524,347,267 | 730,221,071 | 9,282,719,317 | 738,329,191 
2019.08.30 | 4,109,424,594| 953,187,791 | - - 

2020.04.01 | 3,736,165,682 | 1,075,837,576 | 11,200,431,258 | 4,998 ,301,802 


releases; from Oct 2016,° Aug 2019" and Apr 2020.8 ‘2016.10.01’ is the last 
monolithic legacy release, which we added for comparability. Note that we do 
not provide numbers for ‘text’ and ‘wikidata’ data groups for the ‘2019.08.30’ 
due to the incompleteness of these releases. 

The numbers from Table 1 show that the amount of triples in the ‘mappings’, 
‘text’ and ‘wikidata’ data groups is constantly increasing over time. By contrast, 
the ‘generic’ data group provides less triples. This is primarily due to the strict 
testing procedures which have been put in place and as a consequence, invalid 
statements have been not included in the release. Note that the numbers are also 
impacted by the configuration of the DIEF system (e.g. enabled extractors) for 
different releases. Compared to the Wikidata statistic? the DBpedia ‘wikidata’ 
extraction produces five times the amount of statements published by itself, 
mainly because of reification and materialization processes during the extraction 
(e.g. transitive instance types). 


4 Conceptual Design Principles 


Two design principles have driven the design and implementation of the new 
DBpedia release cycle: i) time-driven data releases enable more frequent and 
regular DBpedia releases, and ii) traceability and issue management enables 
more efficient linking of issues with tests and tracking their causes. 


4.1 Time-Driven vs. Quality-Driven Data Releases 


While many of the principles of the agile manifesto are applicable, the most rele- 
vant principle *Working software is the primary measure of progress" [2] can not 
be applied directly to data. As motivated in Sect.2, the judgment of whether 
“data works" is withheld until the “point-of-truth” on the customer/end-user 
side. From our own past experience and from conversations with related devel- 
opment teams, it is a fallacy that the developer or data publisher has the capacity 
to evaluate when “data is useful", following their own quality-driven or feature- 
driven agenda. Since adopting an attitude of “quality creep”!° bears the risk of 


$ https://databus.dbpedia.org/vehnem/collections/dbpedia-2016- 10-01. 
T https: //databus.dbpedia.org /vehnem/collections/dbpedia-2019-08-30. 
8 https: //databus.dbpedia.org/vehnem /collections /dbpedia- 2020-04-01. 
? https: //tools.wmflabs.org/wikidata-todo/stats.php. 

10 Analogous to feature creep in software. 
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delaying releases and prevent data reaching end-users with valuable feedback, 
we decided to switch to a strict time-based schedule for releasing following these 
principles: 


1. Automated Schedule vs. Self-discipline. Releases are fully automated 
via the MARVIN extraction robot. This alleviates developers from the decision 
when “data is ready”. Else extensive testing of data might have an adverse 
effect. Developers are prone to “fixing one more bug” instead of delivering data 
for proper end-user feedback. 


2. Subordination of Software. The whole software development cycle is com- 
pletely subordinate to the data release cycle with time-driven, automatic check- 
out of the tested master branch. 


3. Automated Delivery. Data is published on the DBpedia Databus, which 
allows subscription for data (artifacts/versions/files), which in term enables 
auto-updated application deployment!! and therefore facilitating point-of-truth 
feedback opportunities earlier and continuously. 


4.2 Traceability and Issue Management 


Any data issues discovered at the point-of-truth start a costly process of back- 
tracking the error in reverse order of the pipeline that delivered it. The problem of 
tracing and fixing errors becomes even more complicated in Extract- Transform- 
Load (ETL) procedures where the data is heavily manipulated and/or aggre- 
gated from different sources. A quintessential ETL example is the DBpedia sys- 
tem, which implements sophisticated ETL procedures for extraction and refine- 
ment of data from semi-structured mixed-quality and crowd-sourced sources such 
as Wikipedia and Wikidata. Over the years, a huge community of users and 
contributors has formed around DBpedia, that are reporting errors via different 
communication channels such as Slack, Github and the DBpedia forum. A vast 
majority of the issues are associated with i) a piece of data and ii) a procedure 
(i.e. code) which has generated the data. In the past, the management of issues 
has been done in an ad-hoc manner. Recently, we introduced a systematic, test- 
driven approach for managing data and code-related issues using Linked Data. 
In order to enable more efficient traceability and management of issues, we have 
introduced two technical improvements: 


1. Explicit Association of Data Artifacts and Code. Previously DBpedia 
was grouped by language, which made backtracking difficult. Now every created 
and published data artifact is explicitly associated, due to a one-time manual 
mapping, with the procedure (i.e. code) which created the artifact. For example, 
the “instance-types” !? artifact is associated with the “MappingExtractor.scala” 
class which created the artifact (“View code" action on the Databus website) 
H yia Docker, out of scope for this paper, see https://wiki.dbpedia.org/develop/ 
datasets/latest-core-dataset-releases. 
12 https: //databus.dbpedia.org/dbpedia/mappings/instance-types/2020.04.01. 
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This allows for easier tracking of errors and relates data to code. A query!? on 
http://databus.dbpedia.org/sparql revealed that 26 code references exist and 12 
are still missing for the wikidata group. 


2. Semantic Pinpointing for Issue Management. A major difficulty for 
tackling data issues was to identify in which file and version the error occurred. 
Team-internal discussions as well as submitted community issues did not have the 
proper vocabulary to describe the datasets, exactly. Using Databus identifiers, 
these errors can be pinpointed to the exact artifact, version and file. 


Table 2. Testing methodology levels. 


Level Method Description 


Software JUnit Functional software tests on data parsers 
and extractor methods 


Constructs | Custom rules IRI patterns and encoding errors, 
datatype and literal conformity and 
vocabulary usage 


Syntax Syntax parsing Syntax parsing of output files 
implemented with Jena with customized 
selection of applicable errors and 
warnings 


Shapes SHACL A mix of auto-generated and custom 
SHACL test suites for domain and value 
range, cardinality and graph structure 


Integration | SPARQL over metadata | Verifies completeness of the releases and 
overall changes of quality metrics using 
Databus file/package metadata 
Consumer | SPARQL on graph Use case and domain specific SPARQL 
queries at consumer side. 

Point-of-truth evaluation 


3. Test-Driven Approach for Issue Management (Minidump). Testing 
was mostly done after publishing (post-release) and reported issues were often 
ignored as reproduction of the error were either untraceable or required a full 
extraction (weeks) and difficult manual intervention. We created a test suite 
library that can be executed post-release as well as on small-scale, extendable 
Wikipedia XML dump samples (collection of Wikipedia pages), producing a 
small release, i.e. a minidump. Tests on this minidump are executed on git push 
via continuous integration (minutes), thus enabling the following workflow: 1. 
for each reported data issue, a representative entity is chosen and added to the 
minidump. 2. a specific test at the appropriate level (see next section) is devised. 


13 https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/- /tree/master/ 
paper-supplement/codelink. 
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3. the code is improved so that the test passes. 4. post-release the same test is 
executed to check whether the fix was successful at larger scale, also testing for 
side-effects or breaking other parts of the software. 


5 Testing Methodology 


To cover the entire DBpedia knowledge management life cycle, from software 
development and debugging to release quality checks, we implemented a robust 
“Testing Methodology" divided into six different levels listed in Table2. The 
first level affects software development only. The following three levels (Con- 
structs, Syntax, and Shapes) are executed on the minidump as well as on the 
full releases. In comparison, the legacy extraction process did include tests but 
only covered the testing aspects of the Software and Syntax layers. The contin- 
uously updated developer wiki!^ explains in detail, which steps are necessary to 
1. add Construct and SHACL tests, 2. extend the minidumps with entities, 3. 
configure the Apache Jena-based parser and 4. run the tests and find related 
code. Besides the improvement in efficiency, the levels of testing were extended 
to cope with the variety of issues submitted to the DBpedia Issue tracker!>. 


trigger:dbpedia ontology a cv:IRI Trigger ; 
rdfs:label "DBpedia Ontology IRIs trigger" ; 
cvt:pattern  "^http://dbpedia.org/ontology/.*" . 
validator:dbpedia ontology a cv:IRI_Validator ; 
rdfs:label "DBpedia Ontology validator" ; 
cvv:oneO0fVocab <dbpedia/ontology/dbo-snapshots.nt> . 
<#dbpediaOntology> a cv:TestGenerator ; 
cv:trigger trigger:dbpedia_ontology ; 
cv:validator validator:dbpedia ontology . 


Listing 1: Test case covering the correct use of the DBpedia ontology. 


5.1 Construct Validation 


To investigate the layout and encoding conformity of produced data, we intro- 
duce an approach that focuses on the in-depth validation of its pre-syntactical 
constructs. This concept differs from Syntactical Validation, since it does not 
rely on the complete syntactical correctness of the analyzed data, but checks the 
conformity for its single constructs. A construct can be any character or byte 
sequence inside a data serialization, typically a specific part in the EBNF gram- 
mar [12]. In the case of RDF NTriples and DBpedia, interesting constructs are 
IRIs or literals represented by the subject, predicate, or object part of a single 
triple. Blank nodes are ignored as they follow unpredictable patterns. Moreover, 


14 http:/ /dev.dbpedia.org/Improve. DBpedia. 
15 bttps://github.com/dbpedia/extraction-framework/issues?q-is?63Aissue--is063A op 
en--labelV 263A ci- tests. 


The New DBpedia Release Cycle 9 


a single construct can be validated independently of inaccuracies in the rest of 
the data. This method can be used to gain better test coverage metrics over 
specific data parts, such as IRI patterns in RDF. 

Assessing layout quality of an IRI is motivated by: 


1. Linked Data HTTP requests are more lenient towards variation. RDF and 
SPARQL are strict and require exact match. Especially it is relevant that each 
release uses the exact same IRIs as before, which is normally not handled in 
syntactical parsing. 

2. optional percent-encoding, especially for international chars and gen/sub- 
delims! = 1, p, ues cot re te ee uen 

3. Valid IRIs with wrong namespace http://www.wikidata.org/entity /Q64 or 
https: //www.wikidata.org/wiki/Q64 or wrong layout (e.g. wkd:QQ64) 

4. Correct use of vocabulary and correct linking 


Complementary to Syntactical Validation, this approach provides a more fine- 
grained quality assessment methodology and can be specified as follows: 


Construct Test Trigger: A Construct Trigger describes a pattern (e.g., a 
regular expression or wildcard) that covers groups of constructs (i.e. namepsaces 
for IRIs) and assigns them to several domain-specific test cases. Moreover, if a 
trigger matches a given construct, then it triggers several validation methods 
that were assigned by a test generator. These patterns are highly flexible, and 
it is possible to define overlapping triggers. 


Construct Validator: To verify a group of triggered constructs, a Construct 
Validator describes a specific reusable test approach. Several conformity con- 
straints are currently implemented: regex - regular expression matching, oneOf 
- matching a static string, oneOfVocab - is contained in the ontology or vocab- 
ulary, and doesNotContain - does not contain a specific sequence. Further, we 
implemented generic RDF validators, based on Apache Jena, to test the syntac- 
tical correctness of single IRI and literal constructs. 


Construct Test Generator: A construct test generator defines an 1 : n relation 
between a Construct Trigger and several Construct Validators to describe a set 
of test cases. 

For our approach, it was convenient to use Apache Spark and line-based 
regular expressions on NTriples to fetc.h these specific constructs. Listing 1 
outlines an example construct test case specification covering DBpedia ontol- 
ogy IRIs, by checking the correct use of defined owl:Class, owl:Datatype, and 
owl:ObjectProperties. The Construct Validation approach seems theoretically 
extensible to validate namespaces, identifiers, attributes, layouts and encodings 
in other data formats like XML, CSV, JSON as well. However, we had no proper 
use case to justify the effort to explore it. 


16 https://tools.ietf.org/html/rfc3987. 
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5.2 Syntactical Validation 


The procedure of Syntax Validation verifies the conformity of a serialized data 
format with its defined grammar. Normally, RDF parsers distinguish between 
different levels of” syntactical correctness", including errors and warnings. Errors 
represent entirely fraudulent statements, in the sense of irreproducible informa- 
tion, and a warning refers to an incorrect format of e.g., a datatype literal. 

It is important to validate and clean the produced output of the DIEF, since 
some of the used methods are bloated, deprecated and erroneous. Therefore, 
the used Syntax Validation is configured to remove all statements containing 
warnings or errors. This guarantees better interoperability in the target soft- 
ware, which might use parsers considering some warnings as errors. The parser 
is a wrapper around Apache Jena, highly parallelized and is configured as fault- 
tolerant to skip erroneous triples and log exceptions correctly. The syntax clean- 
ing process produces strictly valid RDF NTriples, on the one hand, and gener- 
ates RDF syntax error reports, on the other. The original file is also kept on 
MARVIN to allow later inspection. The error reports provide structured input 
for community-driven and automated feedback. Finally, the valid NTriples are 
sorted to remove duplicated statements. This can later be utilized to compare 
iterations or modified versions of specific data releases. 


5.3 Shape Validation 


SHACL (Shapes Constraint Language)!” is a W3C Recommendation which 
defines a language for validating RDF graphs against a set of conditions. These 
conditions are provided as shapes and other constructs expressed in the form 
of an RDF graph. SHACL is used within DBpedia’s knowledge extraction and 
release process to validate and evaluate the results (i.e. generated RDF). The 
defined SHACL tests are executed against the extracted minidump results (Sub- 
sect. 4.2). 


<#Cesky_(rozcestnik)_cs> a sh:NodeShape ; 
sh:targetNode <http://cs.dbpedia.org/resource/Cesky_(rozcestnik)> ; 
# assuring that the disambiguation extractor for Czech is active 
# notice that for some languages the disambiguation extractor is not 
— active (e.g. the case Czech) 
sh:property [ 
sh:path dbo:wikiPageDisambiguates ; 
sh:hasValue <http://cs.dbpedia.org/resource/Cesky> ; 
] 


Listing 2: SHACL test for existence of Czech disambiguation links. 


17 Edited by D. Kontokostas, the former CTO of DBpedia: 
https:/ /www.w3.org/ TR/shacl/. 
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Motivating Example. Recently, the Czech DBpedia community identified that 
the disambiguation links have not been extracted for Czech. The lack was discov- 
ered by an application-specific integration test (next section). Upon fixing the 
problem (configuration-related), a SHACL test (Listing 5.3) was implemented 
which will in future detect non-existence of the “disambiguation links” dataset 
on commit by checking a representative triple. 


5.4 Integration Validation 


Since software and artifacts possess a high coherence and loose coupling, addi- 
tional methods are necessary to ensure overall quality control. To validate the 
completeness of a final DBpedia release, we run SPARQL queries on the Databus 
graph in order to check if all expected files are found. Listing 3 shows an example 
query to acquire an overview of the completeness of the mappings group releases 
on the DBpedia Databus.!? Other application-specific tests exists, e.g. DBpedia 
Spotlight needs 3 specific files to compute a language model.!? 


6 Experimental Evaluation 


Section 3 and Table 1 has already introduced and discussed the size of the new 
releases. For our experiments, we used the versions listed there and in addition 
the MARVIN pre-release. 


SELECT ?expected ?actual ?delta ?versionStr ?versionIRI 1 
{SELECT ?expected (COUNT(DISTINCT ?distribution) AS ?actual) 
((?actual-?expected) AS ?delta) ?versionIRI ?versionString { 
VALUES (?artifact ?expected) { 
( mapp:geo-coordinates-mappingbased 29 ) 
mapp:instance-types 80 ) 
mapp:mappingbased-literals 40 ) 
mapp:mappingbased-objects 120 ) 
mapp:specific-mappingbased-properties 40 ) 


IN NINN 


} 
Tdataset dataid:artifact ?artifact . 
Tdataset dataid:version ?versionIRI . 
?dataset dcat:distribution ?distribution . } 
} 
FILTER(?delta != 0) 
} ORDER BY DESC(?versionStr) 


Listing 3: SPARQL integration test comparing expected file counts of artifacts 
with the actually released number. 


18 All groups https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/- / 
tree/master/test. 

19 SPARQL query at https://forum.dbpedia.org/t/tasks-for-volunteers/163 Languages 
with missing redirects/disambiguations. 


12 M. Hofer et al. 


As a variety of methods (e.g. [7], a pre-cursor of SHACL) has been evaluated 
on DBpedia before and is not repeated here. We focused this evaluation on the 
novel Construct Validation, which introduce a whole previously invisible error 
class. Results are summarized, detailed reports will be linked to the Databus 
artifacts in the future. For this paper, they are archived here.?? 


Construct Validation Tests. To validate the constructs of the triples pro- 
duced by DIEF, we specified generic and custom domain-specific test cases. 
With respect to the constructs in Subsect. 5.1, we provide different test cases 
for IRI compliance and literal conformity to increase the test coverage over the 
extracted data. The IRI test cases focus on the encoding or layout of an IRI, 
and check the correct use of several vocabularies. In case of extracted DBpedia 
instance IRIs, the test cases validate the correctness considering that a DBpedia 
resource IRIs should not include sequences of ‘?’, 4£, ‘P, ‘]’, 121, "424, "526", 
‘G27’, 428, 7429, "42A, "2B", ‘%20, ‘43B, ‘43D’ inside the segment part and fol- 
lows Wikipedia conventions. The vocabulary test cases, which will be automated 
later, include tests for these schemas:?! dbo, foaf, geo, rdf, rdfs, xsd, itsrdf, 
and skos to ensure the use of the respective ontology or vocabulary specifica- 
tion. Further, generic IRI and literal test cases are implemented to test their 
syntactical correctness and to validate the lexical format of typed literals. The 
full collection of specified custom Construct Validation test cases is versioned at 
the DIEF git repository.?? 


Construct Validation Metrics. We define Construct Validation Metrics to 
measure the error rate and the overall test coverage for IRI patterns, encoding 
errors, datatype formats and vocabularies used in the produced data. The overall 
construct test coverage is defined by dividing the number of constructs that at 
least trigger one test by the total amount of found constructs. 
Coverage :— Triggered Constructs / Total Constructs 
The overall error rate (in percent) is determined by dividing the number of 
constructs that have at least one error by the total number of covered constructs. 
Error Rate := Erroneous Constructs / Covered Constructs 


Test Results. The custom tests for the DBpedia ‘generic’ and ‘mappings’ 
release have an average of 87% IRI coverage (cf. Table3). The test coverage 
can be increased by writing more custom test cases, but concerning the 80/20 
rule, this could result in high efforts and the missing IRI patterns are presumably 
used inside of homepage or external link relations. The new strict syntax clean- 
ing was introduced on the ‘2019.08.30’ version of the mappings release and later 
applied to the ‘generic’ release. It removes a significant amount of IRIs from the 
‘generic’ version (~500 million) and only a fraction from the ‘mappings’ release, 
reflecting the different extraction quality of them both. Although strict parsing 


20 https: / /git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/- /tree/master / 
paper-supplement/reports. 

?! bttp://prefix.cc. 

?? https://github.com/dbpedia/extraction-framework/blob/master/dump/src/test./ 
resources/dbpedia-specific-ci-tests.ttl. 
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Table 3. Custom Construct Validation test statistics of the DBpedia and MARVIN 
release for the generic and mappings group (Gr). Displaying the total IRI counts, the 
Construct Validation test coverage of IRIs, and construct errors (e.g. wrong IRI pattern 
or vocab usage) of certain Databus releases. 


Gr. Release | Version IRIs total Coverage | Errors Error rate 
Generic | DBpedia | 2016.10.01 | 12,228,978,594/83.93% | 15,823,204|0.15% 
MARVIN } 2019.08.30 | 11,089,492,791|90.98% — 18,113,408} 0.18% 
DBpedia | 2019.08.30 | 11,089,492,791|90.98% | 18,113,408} 0.18% 
MARVIN } 2020.04.01 | 10,527,299,298 |89.59% | 18,662,921|0.20% 
DBpedia | 2020.04.01 | 10,022,095,645 |89.32% | 18,652,958 |0.2196 
Mappings|DBpedia | 2016.10.01 2,058,288,765 | 84.01% 6,692,902 | 0.39% 
MARVIN |} 2019.08.30) 2,686,427,646 | 85.99% 6,951,976 | 0.30% 
DBpedia | 2019.08.30) 2,678,475,356 | 86.01% 6,875,930 | 0.30% 
MARVIN | 2020.04.01) 3,020,660,756 | 86.24% 7,514,376 | 0.29% 
DBpedia | 2020.04.01, 3,019,942,481 | 86.24% 7,505,332 | 0.29% 


was used and invalid triples are removed, the other errors remain, which we con- 
sider a good indicator that the Construct Validation is complementary to syntax 
parsing. 

Table 4 shows four independent Construct Validation test cases. 


XSD Date Literal (xdt). This generic triple test validates the correct for- 
mat use of xsd:date typed literals ("yyyy-mm-dd"^^xsd:date). Due to the use 
of strict syntax cleaning, as shown in Table4, subsequent release later than 
‘2016.10.01’ do not contain incorrectly formatted date type literals, loosing sev- 
eral million triples. Removing warnings leads to better interoperability later. 


RDF Language String (lang). The DIEF uses particular serialization 
methods to create triples that are often duplicated and contain deprecated 
code fragments. The post-processing module had an issue to build correct 
rdf:langString serializations by adding this IRI as explicit datatype instead 
of the language tag. Considering the N-Triples specification, this is an implicit 
literal datatype assigned by their language tags. This bug was not recognized by 
later parsers (i.e. Apache Jena), because the produced statements are syntacti- 
cally correct. Therefore, to cover this behavior we introduce a generic test case 
for this kind of literals. The prevalence of this test is described by the pattern 
"'*"^^rdf:langString!' and the test validation is defined by an assertion that 
the pattern should not exist. Moreover, if a construct can be tested, the test 
directly fails and so the prevalence of the test is equal to its errors. A post- 
processing bug fix was provided before the ‘2020.04.01’ release, and considering 
Table 4 was solved properly. 
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Table 4. Construct Validation results of the four test cases: XSD date literal (xdt), 
RDF language string (lang), DBpedia ontology (dbo) and DBpedia Instance URIs 
which contain a question mark (dbrq). We mention the total number of triggered 
constructs (prevalence), the aggregated amount of errors, and the percentile error rate. 


Gr. Test | Version Prevalence Errors Error rate 
DBp. generic xdt | 2016.10.01 | 32,104,814 4,419,311 13.77% 
2019.08.30 | 28,193,951 0 0% 
2020.04.01 | 26,193,711 0 0% 
lang | 2016.10.01 | 229,009,107 | 229,009,107 | 100% 
2019.08.30 | 353,220,047 | 353,220,047 | 100% 
2020.04.01 | 0 0 0% 
DBp. mappings | dbo | 2016.10.01 | 419,745,660 6,686,707 | 1.594% 
2019.08.30 | 496,841,363 | 6,857,202 1.381% 
2020.04.01 | 567,570,166 | 7,500,707 | 1.322% 
dbrq | 2016.10.01 | 853,927,831 |0 0% 
2019.08.30 | 1,198,382,078 | 15,407 0.001286% 
2020.04.01 | 1,354,209,107 | 0 096 


DBpedia Ontology URIs (dbo). To cover correct use of correct vocabularies, 
some ontology test cases are specified. For the DBpedia ontology this test is 
assigned to the *'http://dbpedia.org/ontology/*' namespace and checks for 
correctly used IRIs of the DBpedia ontology. The test demonstrates that the used 
DBpedia ontology instances used inside the three *mappings' release versions do 
not conform with the DBpedia ontology (cf. Table 4). By inspecting this in detail, 
we discovered the intensive production of a non-defined class dbo:Location, 
which is pending to be fixed. Error rate is lower in later releases, as size increased. 


DBpedia Instance URIs (dbrq). This test case checks one encoding criterion 
of extracted DBpedia resource IRIs. Therefore, if a construct matches *http:// 
[a-z\-]*.dbpedia.org/resource/*’ the last path segment is checked not to 
contain the ‘?’ symbol as this kind of IRIs should never carry a query part. As 
displayed in Table4, the incorrect extraction of the dbr IRIs considering the ‘?’ 
symbol occurred for version ‘2019.08.30’ and was then solved in later releases. 


Test Coverage of Non-DBpedia Datasets. To show the re-usability of the 
Construct Validation approach, we analyzed a set of external RDF datasets.?? 
For these datasets our custom test cases achieved an average coverage around 
10%. (cf. Table 5). The biggest part is covered by the custom vocabulary tests, 
especially foaf, rdf, rdfs and skos are commonly used across multiple RDF 
datasets. Another useful test case represents the correct use of DBpedia IRIs 
inside these datasets (inbound links). Almost in all external datasets, it could 
be recognized that backlinked DBpedia instances or ontology IRIs are wrong 


23 https://databus.dbpedia.org/vehnem/collections/construct-validation-input. 
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Table 5. Custom Construct Validation statistics and triple counts of external RDF 
NTriples releases on the Databus. Including IRI test coverage and the number of failed 
tests based on the custom DBpedia Construct Validation. 


Dataset Version Triples IRIs Total | Coverage | Errors Error rate 
CaliGraph [4] 2020.03.01 | 321,508,492 | 954,813,788 | 48.50% | 30,478,073 | 6.58% 
MusicBrainz [13] | 2017.12.01 163,162,562 | 443,227,356 | 12.58% | 23 0.00% 
GeoNames? 2018.03.11 | 176,672,577 | 468,026,653 | 10.86% | 321,865 0.63% 
DBkWik [5] 2019.09.02 127,944,902 | 322,866,512 | 6.91% |18 0.00% 
DNBP 2019.10.15 | 226,028,758 | 502,217,070 | 3.37% | 14 0.00% 


“https: //www.geonames.org 
bGerman National Library - https://www.dnb.de 


encoded or incorrectly used. In the case of RDF, this demonstrates that the 
introduced test approach can validate links between independently produced 
Linked Open datasets. 


Limitations. Coverage of Construct Validation. As demonstrated the Construct 
Validation can test for issues that are not covered by the Syntax or Shape Val- 
idation. But for fine-grained testing, to reach a 100% IRI test coverage on an 
extracted dataset, it is quite hard to define test cases for every used namespace 
and vocabulary, concerning their encoding and layouts (e.g., external links). 
Comparison of releases. The number of enabled extractors, produced artifacts, 
extracted languages, new tests, and mappings can change in newer releases. 
Therefore, it is challenging to compare evolving releases containing a different 
set of files and single files that provide more or fewer triples. 


7 Related Work 


At the conceptual level, our work is very related to the “Engineering Agile Big- 
Data” concepts described in [3] and inspired and based on those particular con- 
cepts. Below we discuss the related works to ours and primarily in respect to i) 
data release cycle and ti) data quality assessment. 


Data Release Cycle. The release processes for different knowledge bases are 
naturally different due to the different ways of obtaining the data. Wikidata, as 
the most related open data release project, releases dumps on a weekly basis and 
publishes them in an online file directory?^ without machine-readable descrip- 
tions. In comparison, DBpedia systematically releases data artifacts accompa- 
nied with machine-readable descriptions published on the DBpedia Databus plat- 
form. This enables data consumers to develop intelligent consumer agents which 
can easily find and retrieve relevant data artifacts. 

Besides Wikimedia, there are other open data release initiatives such as 
WordNet [9], BabelNet [10] and YAGO [14]. However, all these projects (with 


24 https:/ /dumps.wikimedia.org/wikidatawiki/entities/. 
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exception Wikidata) do not provide regular time-driven (e.g. monthly, bi-annual 
or annual) releases as DBpedia does. Their current release strategy is feature- 
driven and a new data version is released as soon as a new feature or extension 
has been implemented. This results in delayed and irregular releases. For exam- 
ple, the release of YAGO 4.0 (release in March 2020) took almost three years 
since the previous YAGO 3.1 release (in June 2017). Similarly, BabelNet?? per- 
forms feature-driven releases, with the latest BabelNet 4.0 release from Feb 2018 
and the previous 3.7 release from Aug 20106. 


Data Quality Assessment. Further, we briefly mention two projects that 
attempt Linked Data quality assessment by applying alternative facets. 

Due to the different nature, DBpedia implements software/minidump and 
large-scale validation mechanism. Wikidata performs validation using the Shape 
Expressions Language (ShEx)?° on top of the user generated input. 

TripleCheckMate [1] describes a crowd-sourced data quality assessment app- 
roach by producing manual error reports of whether a statement conforms to 
a resource or can be classified as a taxonomy-based vulnerability. Their results 
showed a broad overview of examined errors but were tied to high efforts and 
offered no integration concept for further fixing procedures. On the other hand, 
RDFUnit is a test-driven data-debugging framework that can run automatically 
generated and manually generated test cases (predecessor of SHACL) against 
RDF datasets [7]. These automatic test cases mostly concentrate on the schema, 
whether domain types, range values, or datatypes adhere correctly. The results 
are also provided in the form of aggregated test reports. 


8 Conclusion and Future Work 


In this paper, we presented and combined several approaches (including time- 
based, test-driven, and traceable development principles) for increasing the agility 
and efficiency of knowledge extraction workflows and demonstrated it in the 
case of the novel DBpedia release cycle. Considering that DBpedia is an enor- 
mous open source project, we introduced a new set of extensive test methods, to 
offer a convenient process for community-driven feedback and development. The 
DBpedia Databus is used as a quality control interface, due to the utilization 
of traceable metadata. The Construct Validation test approach provides a more 
in-depth issue tracking checking for wrong formatted datatypes, inconsistent use 
of vocabularies, and the layout or encoding of IRIs produced in the extracted 
data. In combination with Syntactical and Shape Validation, this covers a large 
spectrum of possible data flaws. Moreover, it was shown that the minidump- 
based and large-scale test concept provides a flexible view to directly link tests 
with existing issues. The described workflow builds a reliable and stable base for 
future DBpedia (or other quality-assured data) releases. However, we presented 
only a few specific examples of how testing and development of the release pro- 
cess is improved. Therefore, the full potential of how the testing methodologies 


?5 https:/ /babelnet.org/stats. 
?6 https:/ /shex.io. 
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increase agility and productivity can only be measured after their adoption by 
the community in the next years. As an overall result, the new DBpedia release 
cycle produces over 21 billion triples per month with minimal publishing effort. 
As future work, we will link all created evaluation reports to Databus artifacts, 
similar to the explained code references (cf. Subsect. 4.2). Further, we plan to 
extend the usability of the release dashboard. 
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Abstract. While thousands of ontologies exist on the web, a unified sys- 
tem for handling online ontologies — in particular with respect to discov- 
ery, versioning, access, quality-control, mappings — has not yet surfaced 
and users of ontologies struggle with many challenges. In this paper, we 
present an online ontology interface and augmented archive called DBpe- 
dia Archivo, that discovers, crawls, versions and archives ontologies on 
the DBpedia Databus. Based on this versioned crawl, different features, 
quality measures and, if possible, fixes are deployed to handle and sta- 
bilize the changes in the found ontologies at web-scale. A comparison to 
existing approaches and ontology repositories is given. 


Keywords: Ontology archive - Ontology repository - Ontology 
crawling 


1 Introduction 

Phrases such as “A little semantics goes a long way”! or “Let a thousand ontolo- 
gies blossom” [7] have shaped the landscape of ontologies on the Semantic Web. 
Ontologies are the common language spoken on the Semantic Web, they repre- 
sent schema knowledge and provide a common point of integration and reference 
while the value of an ontology grows with its use. As the conceptual framework 
to globally interlink distributed knowledge, ontologies provide the backbone of 
the Semantic Web. 

While thousands of ontologies exist on the web, a unified system for han- 
dling online ontologies has not yet surfaced and both publishers and users of 
ontologies struggle with many uncertainties and challenges. The main discussion 
and effort so far in the Semantic Web community is unbalanced and focused on 
authoring and publication of ontologies and linked data in general with serious 


1 http://www.cs.rpi.edu/~hendler/LittleSemanticsWeb.html. 
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consequences. The community produced several guidelines, rules, methodolo- 
gies and tooling for publishers neglecting users and clients. However, the variety 
increases uncertainty by offering too many choices, increases effort and com- 
plexity through the need to understand and implement several guidelines and 
provides no or unclear incentives or rewards to the publisher to comply with 
them. 

As a consequence, the consumer is left to deal with the resulting hetero- 
geneity, quality issues and failures. The majority of problems and challenges fall 
into the categories access and quality. We have identified several Usage Chal- 
lenges which we enumerate in parentheses for reference in the remainder of 
the paper. Major physical access problems are caused by link rot (UC1) and 
incorrect Linked Data deployments (UC2), but most crucially there is no estab- 
lished, stable citation or dependency system for ontologies like Maven or DOI 
(UC3) - ontologies or parts of it can change or disappear anytime. Addition- 
ally, heterogeneity increases the complexity to access ontologies. There can be 
no, unclear or inconsistent versioning (UC4a), the versioning nomenclature can 
substantially vary (UC4b) and guarantees w.r.t. backward-compatibility usu- 
ally remain unclear (UC4c). Various formats to serialize OWL ontologies exist 
(e.g. OWL-XML, RDF-XML, Manchester Syntax, Turtle; UC5). In case that an 
application/consumer succeeded to retrieve an ontology (version) several qual- 
ity problems can prevent proper processing /usage. Parsing of the RDF snapshot 
can fail (UC6), problems w.r.t. licensing can prevent the usage at all (UC7) due 
to missing, unclear, heterogeneous (several properties and license IDs) or too 
restrictive or improper licensing. Finally, the fitness for use can be limited due 
to low quality metadata (e.g. missing labels, title; UC8) or logical inconsistencies 
(UC9). 

In this paper, we present a web-scale ontology interface called DBpedia 
Archivo (acronym for ontology archive), that discovers, crawls and versions 
ontologies and archives as well as augments them on the DBpedia Databus [5]. 
The primary purpose of this interface is to help users/consumers to discover, 
access and validate/assess the quality/usability of ontologies in a unified way, 
while reducing the challenges and effort to spot and deal with the mentioned 
issues, such that they can focus on building stable and reliable applications. 
Nevertheless, we also aim to support both the consumer and the publisher by 
augmenting the ontology (e.g. reporting quality metrics, generating documenta- 
tion). We envision in the mid/long-term, that with the help of Archivo we foster 
the adherence to standards (publicly showing issues, basic quality control for 
access to Archivo) and strengthening incentives for publishers (bad metadata 
e.g. no dct:title, dct:description results in worse findability and presentation in 
Archivo), such that the overall quality of the ontologies in the Web of Data 
emerges, which in return would benefit users and applications. 

We argue, that a crucial factor for the success of the web were working web 
browsers and search engines that increased user numbers and views and created 
incentives to publish correct and high quality websites. Following this line, as a 
novel paradigm, DBpedia Archivo (see Fig. 1) proposes a consumer/application- 
oriented approach to the Semantic Web. 
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Decentral publication paradigm 
. e Ontologies published on the Web Web / Linked Data Publishing Ontologies consumed from the Web 
e. reliable, persistent hosting ? e discovery ? 
e technical correctness ? e versioning ? 
« FAIR? Other guidelines ? » homogenization ? 
« Sufficient effort (docu, validation)? 
« Effort - benefit ratio ? e upstream patches ? 


* * 


Ontology co-evolution with Archivo E DBpedia Archivo Ontologies consumed from Archivo 


e error handling, cleanup ? 


> 


Fig. 1. Interface and platform model of Archivo 


At a glance, with DBpedia Archivo we make the following contributions: 


1. Discovery (including user suggestions), crawling, versioning, archiving and 
evaluation of ontologies with a high degree of homogenization and 
automation, 

2. unified, stable, referenceable identifiers for each ontology version, so 
that ontology consumption becomes stable and applications, experiments and 
research with a specific version of an ontology, can be reproduced at any time, 

3. unified time-based and Semantic Versioning enabling auto update applica- 
tions with custom trade-off between latest changes and stability (user con- 
trolled up-to-dateness), 

4. the augmented archive includes add-ins and extensions which enhance the use 
of an ontology, among others, generated documentation, quality reporting 
with a consumer-oriented star rating and results of validation and test steps. 


In the following section we provide an overview on related work. In the sub- 
sequent section we briefly introduce the conceptual ideas of Archivo and its 
platform model. Sect.4 describes the implementation. In Sect. 5 we introduce 
an automatically verifyable consumer rating. An evaluation of an initial crawl 
of ontologies based on our rating as well as a comparison of Archivo to existing 
ontology repositories is given in Sect. 6. 


2 Related Work 


Related work can be separated into three areas: archiving and versioning tools 
for ontologies, ontology repositories (which are compared in depth to Archivo in 
Sect.6) and ontology validation and testing tools. 
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2.1 Archiving and Versioning 


The Memento protocol [19] allows to discover and browse old versions (Memen- 
tos) of web resources. The Internet Archive provides a prominent service, Way- 
backMachine,? a generic archive for web resources (including a subset of ontolo- 
gies from the web) accessible using Memento. Moreover, Memento is used and 
adapted by the TailR system [11], a selfdeploy/service archiving system for 
Linked Data resources and the Triple Pattern Fragment Server which can 
be used to serve and query archived Linked Data [21] with lower infrastructural 
efforts. Unfortunately, Memento is currently not (widely) adopted for ontology 
publication and to the best of our knowledge, there is no support for Memento in 
ontology tools, yet. Archivo offers with SPARQL and Linked Data well-known, 
standardized and with the help of DataID metadata a unified way to discover, 
access but also query (relevant) versions of an ontology but additionally serves 
as a central point to discover (archived) ontologies. Realization of Memento on 
top of Archivo is possible and subject to future work. 

Sem Version [22] proposes a methodology and Java API for RDF (and ontol- 
ogy) versioning inspired by CVS. It offers a structural and a form of semantic 
diff between two versions, achieved by performing structural diffs on semantic 
closures (RDF(S) entailment). The semantic diff of Archivo based on (OWL) 
axiom diffs goes a step further. Quit [2] implements an RDF versioning and 
collaboration system on top of Git. It provides unified access via SPARQL 1.1 
on each version of an ontology and the versioning history. Both systems focus 
on ontology development rather than the consumer perspective. 

D2V is a tool to manage and visualize user-defined changes in RDF data. In 
[17] it is demonstrated for ontology evolution measuring specific types of changes 
(e.g. added properties / labels or deprecated classes). While D2V allows very flex- 
ible, use-case/dataset specific-analysis of changes, Archivo's additional Semantic 
Versioning aims at making the trade-off between unified and flexible/fine-grained 
change reports with 3 types of changes (major, minor, patch). 

Vocol [6] is an integrated environment based on Git and several services to 
enable collaborative vocabulary development. The workflow consists of 3 activi- 
ties: modeling, population and testing (syntactic and semantic validation, com- 
petency questions), deployment of ontologies (machine- and human-readable). 
While some of the features (semantic diff and validation, documentation genera- 
tion, custom tests for ontologies) are similar to Archivo, Vocol was designed for 
publishers, consumers depend on them to take advantage of the system. 


2.2 Ontology Repositories and Platforms 


There have been ample efforts to provide a platform, repository, library or other 
web services to deal with storage, search, retrieval of ontologies, some of which 
do not exist or work properly anymore. For reasons of brevity, we only mention 
approaches which are, to the best of our knowledge, still active and functional. 
We refer the reader to [4] for a time travel to a decade ago. 


? https: //archive.org/web/. 
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In our scope we identify 4 major characteristics of such systems. An archive 
persists ontologies (and its versions). A catalog associates a list of ontologies 
with thorough metadata. As index we denote a system that allows to search 
components (e.g. classes) of ontologies. A development platform is a workspace 
with integrated tools to create and handle ontologies. 

OntoMaven [13] is a distributed ontology archiving approach based on 
the maven philosophy. Ontologies and its dependencies are organized in mvn 
artifacts. As a consequence transitive imports can be resolved and downloaded 
locally. A set of mvn plugins supports several aspects of ontology development 
lifecycles, e.g. import, creation of documentation and reports, consistency tests 
and versioning. Although we were not able to find any announced public repos- 
itory, the ontology organization structure is very similar to the one of DBpedia 
Databus [5] Archivo is based on. 

OBOFoundry [18] is an ontology developer initiative in the biological and 
biomedical domain which manually curates a catalog of approved ontologies. The 
registering of new ontologies follows a set of design principles (e.g. naming con- 
vention, versioning strategy) which are verified semi-automatically. The foundry 
operates its own PURL service to offer stable identifiers. 

BioPortal [23] is another prominent catalog in the biomedical domain. It 
offers storage for ontology submissions and archiving to registered users and per- 
forms indexing on the latest submission. Moreover, it offers developer platform 
features such as user access rights and mappings between ontologies. 

Linked Open Vocabularies [20] (LOV) is a semi-automatically curated cata- 
log of vocabularies. It offers a search index on the terms defined in the vocabu- 
laries, a SPARQL Query endpoint and provides persistent access to the history 
of vocabularies. New vocabularies are discovered by analyzing (re)use of terms 
from archived ontologies or can be suggested by users. 

Ontobee [12] creates an index for OBOFoundry and a portion of other 
ontologies. It serves the ontologies as linked data and provides search and brows- 
ing interfaces. Another index in the biomedical domain is the Ontology Lookup 
Service [10]. 

OntoHub [3] is an open ontology repository engine with versioning based 
on Git following Open Ontology Repository Initiative (OOR) requirements. It 
offers homogeneous formal representation of ontology axioms using DOL, testing 
with HETS and competency questions. An instance of it operates ontohub.org 
which is free to users and contains a plethora of ontologies, including imports 
from other repositories. 


2.8 Ontology Evaluation and Validation 


The list of literature with rules and guidelines to follow is extensive. We would 
like to list [9,15,24], the LD principles,? LOD Cloud, LOV? and refer to their 


3 https: //www.w3.org/DesignIssues/LinkedData.html. 
* https: //lod-cloud.net /. 
5 https://lov.linkeddata.es/Recommendations. Vocabulary. Design.pdf. 
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references for brevity. We picked the prominent Ontology Pitfall Scanner! 
(OoPS!) [16], also used by Archivo, as a representative for the many existing 
validation & evaluation approaches as it provides an excellent overview of other 
literature. OnToology [1] is a service (based on OOPS and other tools) to 
create pull request for ontologies hosted on GitHub to deliver test reports and 
documentation. It is similar to the ontology augmentation concept of Archivo, 
however needs to be configured and managed by the repository owner/publisher. 

ROBOT [8] deserves a special mention as a highly automatized and con- 
figurable evaluator. The idea here is that sub-communities for certain domains 
(e.g. biological and -medical) configure and deploy the tool for their commu- 
nity. While similar (configure local needs, deploy local), Archivo follows a more 
generic approach (configure local needs, deploy global). 


3 Archivo Platform Model 


3.1 Versioning and Persistence on the Databus 


DBpedia Archivo is built on top of the DBpedia Databus [5], which is inspired 
by Maven Central Repository. It uses the maven concepts publisher /group/ar- 
tifact/version and ports them to a Linked Data platform, in order to manage 
data pipelines and enable automatic publishing and consumption of data. 

Archivo is a dedicated publishing agent on the Databus. Similar to [13] arti- 
fact IDs (represented as IRIs) are used as stable identifiers to reference an ontol- 
ogy with no regard to its evolution (UC1 and UC3). A version string appended to 
the artifact IRI forms a stable ID to resolve a particular version. An extension of 
the DataID metadata vocabulary for artifact, version, and files allows for flexible 
and fine-grained access using SPARQL. The concepts of time-based (UC4a/b) 
and semantic versioning (UC4c) support increased stability of applications while 
allowing automatic updates to some (user-configurable) degree. 

Databus file identifiers form a stable abstraction layer independent of hosting 
and similar to PURL by using dcat : downloadURL links in the metadata. Crawled 
ontologies and metadata are persisted on the DBpedia download server’. Cre- 
ating a mirrored archive of ontology versions such as Archivo is, of course, not 
infallible. We consider it, however, a sufficiently reliable fall-back to improve 
persistence of ontologies on the Semantic Web. 


3.2 Evaluation Plugins and SHACL Library 


DBpedia Archivo largely builds on the W3C SHACL? standard. While minimal 
basic validation as described in Sect.5 is fixed (part SHACL, part code), the 
remaining validation is done via a SHACL library that is partitioned into SHACL 


ê https:/ /databus.dbpedia.org/ontologies. 

T 13 years in existence, backed up by libraries (TIB) and universities (Mannheim) who 
are DBpedia Association members. 

8 https:/ /www.w3.org/'TR/shacl/. 
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test suites for specific purposes: 1) they can encode general validation rules (e.g. 
from OOPS and tackle UC7), 2) they can capture specific requirements needed 
by Archivo features such as the automatic HTML documentation generation 
of LODE (UC8) (cf. next section), 3) they can be sub-community or use case- 
specific down to individual user projects. While at the time of writing few SHACL 
test suites exist, we allow online contribution and extension (Validation as a 
Platform) for Archivo to run in the hope to give consumers a central place to 
encode their requirements and also discuss and agree on more universal ones. 


3.3 Feature Plugins 


Feature plugins in DBpedia Archivo augment a certain aspect of the ontology, 
e.g. generate documentation, visualization or automatic mappings. While a com- 
plete overview is out of scope of this paper, we integrated the Live OWL Doc- 
umentation Environment (LODE) [14] into Archivo, which generates a uniform 
HTML documentation for each version of all archived ontologies. Adding more 
features is straightforward. Pre-generated results make them universally avail- 
able for all ontologies and absolve publishers and consumers to find, learn and 
deploy such ontology tools. 


4 Archivo Implementation 


The guiding principle for Archivo’s implementation follows Jon Postel’s law: “Be 
conservative in what you do, be liberal in what you accept from others”. Being 
“liberal” in the context of Archivo has clear limits. While we accept ontologies 
in different formats, work around small mistakes (e.g. also recognizing incorrect 
dc:license triples instead of dct:license) (UC8) and even use recovering parsers 
that can skip syntax errors (UC6), we decided to be strict in all aspects that 
directly contradict the automatic processing of ontologies and therefore either 
heavily impact their usefulness or require meticulous archaeological excavation 
work to use and archive them. Since we invested the time to implement the 
most common retrieval and processing methods, our guideline is “If DBpedia 
Archivo can not process it in an automatic and deterministic man- 
ner, it is likely infeasible to be processed” based on the assumption that 
the Semantic Web was created for machines. One prominent example here is 
the missing license declaration in the FOAF RDF/XML document,?. While the 
HTML documentation includes the license using RDFa,!° it only yielded 348 
triples, compared to 631 in RDF/XML. While staying "liberal", there is no 
optimal automatic choice on what to accept: half the ontology with license, full 
ontology without license. Our strategy is that we are liberal at the launch of 
Archivo to allow old/unmaintained (but potentially already widely used) ontol- 
ogy versions to be archived but we will become more restrictive (no archiving of 


? http://xmlns.com/foaf/spec/ Supplement: https://github.com/dbpedia/ Archivo/ 
tree/master/paper-supplement. 
1? The subject of the license statement is the HTML document. 
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new ontology (versions) that do not fulfill baseline criteria) after an establishing 
phase. The strictness in such cases stems from the rationale that these non- 
automatic and non-deterministic ontologies will eventually cause an immea- 
surable and unacceptable amount of effort in the downstream network 
of consumers. 


4.1 Ontology Discovery and Indexing 
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Fig. 2. Overview of iterative ontology discovery and archiving 


The goal of the discovery and indexing phase is to create a distinct set (index) 
of non-information URIs/resource (NIR) of ontologies for each iteration as input 
for further crawling and processing. We devised four generic approaches to feed 
Archivo with ontology candidates (crawling candidate IRIs) and implemented 
them as a proof-of-concept. 


Ontology Repositories: One straightforward way of retrieving ontology URIs 
is by querying already existing ontology repositories. The repository with the 
broadest collection of very popular ontologies of the Linked Open Data Cloud 
is Linked Open Vocabularies (LOV) [20], which we used in this paper. LOV 
provides a simple API which contains (among other metadata) candidates for 
non-information URIs. 


Vocabulary Usage Analysis via VoID: Another approach to discover ontol- 
ogy candidates is by analyzing vocabulary usage in the data. Our goal here is in 
particular to cover all vocabularies used by datasets uploaded onto the Databus, 
which already contains several datasets besides DBpedia, such as Geonames, 
Caligraph, MusicBrainz and the German National Library, just to name a few. 
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As the Databus provides a controlled and harmonized environment, we generate 
a virtual class-based and property-based partition!! for all RDF files on the bus, 
thus retrieving a list of all classes and properties. 


Discovery via Links to External Ontologies: As Archivo already creates a 
controlled and harmonized ontology archive, we can exploit the refined collection 
of ontologies from the previous iteration to discover further ontology candidates. 
For this purpose, we extract a list of all subject, predicate and object IRIs from 
the ontologies itself to create more leads to properties /classes/ontology files. 


Manual Suggestion: Automatic discovery is able to capture and persist most 
of the currently available ontologies in a forward-progressing manner. In addi- 
tion manual/external suggestions of ontology candidate IRIs are accepted via 
web form!” to increase Archivo’s coverage and to offer an on-demand archiving 
function (UC3). Moreover, we consider this feature helpful for ontology engineers 
to test and receive feedback already during the development phase. 

Subsequent to the aforementioned discovery steps we crawl/check every can- 
didate IRI. The best effort crawling tries to download multiple RDF files via 
different HTTP-accept headers (in case a robots.txt is not disallowing access for 
the Archivo crawler) (UC2 and UC5). At the time of writing two additional 
rules are in place for considering an ontology/vocabulary as valid candidate for 
inclusion into Archivo: 1) the NIR needs to resolve to an RDF document rapper 
can read, 2) we require the existence of an entity identified by the NIR which is 
typed as owl: Ontology or skos : ConceptScheme (which should carry additional 
metadata and makes the ontology spottable in reliable way) in the triples out- 
put of the failure-tolerant parser. If multiple valid serialization candidates exist, 
we give preference to the serialization having the highest triple count (this will 
archive the correct FOAF version without license). Finally, the NIR is appended 
to the index and the chosen serialization is passed over for a release on the 
Databus. If the spotted NIR doesn’t match with the candidate IRI it started 
with, the retrieved NIR becomes a new NIR and the process starts again (see 
Fig. 2). The crawling candidate IRIs representing properties and classes with a 
slash URI scheme require a special treatment in case the resolution does not 
return the ontology itself. We use skos:inScheme and rdfs:isDefinedBy as 
pointers to a new candidate IRI. 


4.2 Analysis, Plugins and Release 


Analysis and Integration of Feature Plugins: In every new snapshot, we 
augment the original ontology file with a parsed ntriples, turtle and owl 
version to simplify the access (UC5 and UC6). Additionally, to the plugins and 
validation methods described in Sect. 3, the reasoner Pellet!? is used for checking 
the consistency (UC9) of the ontology and determining the OWL profile. Fur- 
thermore an OOPS report (UC8) is generated to detect common pitfalls of the 


11 ef. Sect. 4.5 of VoID: https: //www.w3.org/TR/void/#class-property-partitions. 
1? http:/ /archivo.dbpedia.org/add. 
13 https://github.com/stardog-union/pellet. 
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ontology. All reports are stored alongside the original snapshot with appropriate 
DataID metadata to augment the snapshot. 


Release on the Databus: To deploy an ontology on the Databus we use its non- 
information URI as the basis for the Databus identification. The host information 
of the ontology’s URI serves as the groupId and the path serves as the name 
for the artifactId. Archivo’s lookup component!^ with Linked Data interface 
allows to resolve the mapping from a non-information URI to the stable and 
persistent Databus identifier. 


4.3 Versioning and Persistence 


Time-Based Snapshots: For all verified non-information URIs in the index, 
Archivo looks for new versions a few times each day. To reduce the amount 
of transferred data, Archivo uses the HTTP-headers E-Tag, Last-Modified 
and content-length to detect via a HEAD-request if the respective ontol- 
ogy resource could have changed. If any of the headers changed (or if none 
of the headers is available), the vocabulary is downloaded and checked locally 
for changes. 

The local diff is performed by converting the downloaded source with rap- 
per!’ to canonical N-Triples, sorting them and comparing them with comm! 
to determine if any triple was added or deleted. This process requires the new 
version to be parseable without errors. In case a change could be verified the 
new snapshot is released with using the fetch timestamp as version label. 


Semantic Versioning: If a change in the set of triples was detected, a set of 
(description) logic axioms is generated for both the old and new version of the 
ontology and those axioms are compared to each other. In case of no changes 
in the axioms, no structural ontology change was done (e.g. added only labels, 
or ontology metadata) the change is classified as patch. If only new axioms 
were added, we consider this as a new minor version. If new classes/properties 
are added, this usually leads to no backward-compatibility problems for existing 
applications, but there are cases (e.g. adding a deprecated or disjoint relation to 
a class) which might have consequences in combination with A-boxes. Any dele- 
tion of already existing axioms (thus including renaming) is considered as major 
change potentially seriously affecting backward-compatibility. This semantic ver- 
sioning “overlay” allows a more fine-grained update decision than the binary 
“take it or leave it” (UC4a-c). Users can refine the trade-off with custom solu- 
tions based on the semantic versioning and axiom diffs. We plan that more 
sophisticated versioning overlays can augment the Archivo snapshots with open 
contributions via Databus mods (see Sect. 7). 


14 http://archivo.dbpedia.org/info?o—. 
15 http://librdf.org/raptor/rapper.html. 
16 https:/ /linux.die.net/man/1/comm. 
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5 A Consumer-Oriented Ontology Star Rating 


Following the argumentation of Sect. 4 our proposed rating system is “liberal” to 
a certain degree of heterogeneity, but strict in the sense that it awards low rat- 
ings to ontologies that defy automatic or deterministic processing. The proposed 
star rating differs from written rules and guidelines in human language in these 
aspects: 1) stars are formalized and algorithmically verifiable and can be tested, 
2) they are executed over the known, ontological part of the Semantic Web cap- 
tured in Archivo and are meant to be delivered to consumers to quickly assess 
the technical usability and soundness 3) they are centrally available, frequently 
executed, debatable and extendable. They allow capturing and crowd-sourcing 
of consumer needs. We included short references to other approaches from [16]'” 
(integrated, see below), [8]!? and [9] (VocUse, partly applicable). From DBpedia 
Archivo perspective, some requirements become redundant such as the HTML 
documentation, which can be generated, if the appropriate SHACL test is suc- 
cessful. Others become more strict (machine readability). 


5.1 Two Star Baseline 


We consider the two star baseline as a minimal requirement for considering the 
ontology as a legit participant in the Semantic Web. An ontology which does 
not fulfill the baseline can’t earn any further stars. 


x Retrieval and Parsing: All of the following criteria have to be fulfilled: 
(1) The non-information URI resolves to a machine readable format or a 
machine readable version is deterministically discoverable by other common 
means, (2) download was successful, (3) uses a common format implemented 
by Archivo, (4) at least one format was found that parses with no or few 
(negligible) syntactical warnings (UC6). [OBO fp2, OOPS! P37, VocUse 2] 

x License I/?: A proper ontology declaration was found using a owl: Ontology 
and some form of license could be detected. A high degree of heterogeneity 
is permissible for this star regarding the used property/subproperty as well 
as object: license URI (resolvable linked data or web link), xsd:string or 
xsd:anyURI (UC7). [OBO fp1, OOPS! P38 P41, VocUse 4] 


5.2 Quality Stars 


On top of the two star baseline, Archivo implements additional criteria. The 
main rationale behind these stars is to ease effort for client implementations by 
homogenizing the retrieved data and the technical expectations a client can have 
towards mirrored ontologies by Archivo. 


17 http://oops.linkeddata.es/catalogue.jsp. 

18 OBO, http://obofoundry.org/principles/fp-000-summary.html, link to automated 
checks. 

19 SHACL test  https://github.com/dbpedia/Archivo/blob/master/shacl-library / 
license-I.ttl. 
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x License II: We require a homogenized license declaration using dct: license 
as object property with a URI (not string or anyURI). If a resolvable Linked 
Data URI is used, we expect the URI to match the URI used in the machine 
readable license (UC7). We discovered many irregularities such as trailing ‘/’ 
which violate RDF requirements that URIs need to be exactly the same in 
RDF as opposed to Linked Data resolution. In the future, we plan to tighten 
up this criterion and expect machine readable license, which we will collect 
on the DBpedia Databus in a similar manner as Archivo. [OBO fp1, OOPS! 
P41, VocUse 4] 

x Logical Fitness: Although logical requirements such as consistency are the- 
oretically well-defined, from a consumer perspective this star is highly im- 
plementation-specific. We measure the compatibility with currently available 
reasoners such as Pellet/Stardog (more to follow) and run available tasks 
such as consistency checks (UC9), classification, etc. since ow1:disjointWith 
axioms are nice, unless they render the ontology unusable for reasoning. 
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Fig. 3. Distribution of violations per ontology using SHACL-based LODE tests 


Table 1. Results for Archivo (July 2020) testing and rating 


735 11/453/10/134/127 | 275/460/0 | 137/598/0 | 687/23/25 1/30/702/2 | 103/91/9/29/15/488 


Ll Format: 0/1/2/3/4 Stars Format: True/False/Error ?Format: OK/Warnings/Violations/Error 
^Format: OWL2 FULL/DL/QL/EL/RL/Tool Error 


5.3 Further Stars and Ratings 


We practiced a large amount of self-discipline not to encode more stars with our 
ideas and opinions as they didn't pass our own relevancy criteria (Who needs 
this?). Further stars and ratings could provide direct incentives for ontology 
publishers such as the ability to generate HTML documentation with LODE 
(tested with SHACL) or represent user needs, or could be of analytical nature, 
such as adoption and re-usage (inbound links from other ontologies and data, 
[9] VocUse 3 and 5). 
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6 Evaluation 


6.1 Archivo and Rating Statistics 


DBpedia Archivo consists of 735 ontologies in July 2020. The biggest fraction of 
it (401) was discovered via the LOV-API, 268 were discovered from prefix.cc and 
the rest was retrieved from the subjects, predicates and objects of the ontologies 
in Archivo itself (60) and user suggestions (6). Unfortunately the Usage Anal- 
ysis via VOID didn't yield any new ontologies, but this feature was added at 
last, so the index already contained the used ontologies of datasets from the 
Databus. Figure3 shows the ratio of ontologies that share a class of violations 
numbers. The diagram shows that, even though a small amount of ontologies are 
quite badly curated, the biggest share of ontologies has quite low error numbers, 
allowing a smooth generation of LODE documentation. Table 1 shows that more 
than 6096 of the ontologies have less than two stars. Almost every one star rating 
is caused by a missing license. Since an open license is a fundamental require- 
ment of open data, it is a bad sign for the usability of the available ontologies 
on the web. With more than 9096 of logical consistency the ontologies are sitting 
pretty, but as mentioned this value can be highly implementation specific. 


6.2 System Comparison 


We identified 7 other (ontology repository) systems which are either very similar 
on a conceptual or technical level (e.g. LOV, OntoMaven) or are active systems 
which serve a notable set of ontologies to users. While the type and primary 
usage of the systems vary, we assessed them under a common set of features 
along the 4 dimensions coverage, recency, access and quality (see Table 2). While 
access and quality dimensions stem from the problem analysis, a sound strategy 
for both a high coverage and recency w.r.t. archived ontologies seem natural 
requirements from the perspective of users and tools demanding for one unified 
solution to efficiently tackle the problems. We argue that such a system needs to 
offer and be built on a high level of automation and homogenization (unified and 
standardized/well known practices) to successfully tackle web-scale dimensions 
and (if done correctly) optimize client side processes (decreased consumer side 
effort and increased usage benefits). We selected features reflecting this. 
Archivo is the only system offering a fully automatically processed and invok- 
able user inclusion request for an ontology (LOV requires a thorough review by 
its community). Apart from LOV, which analyzes referenced ontologies, none 
of the systems implemented a strategy to discover and include further ontolo- 
gies or even use multi-layered approaches like Archivo. Besides OBO foundry 
and OntoMaven relying on a push-only approach, all systems use an automatic 
fetc.h (update) mechanism to serve the latest version of an ontology. Archivo is 
the only system providing Semantic Versioning and guaranteeing fully automatic 
unified versioning, whereas Bioportal and LOV try to extract unified timestamp 
versioning metadata but also partially rely on correct user input, OBO f. has a 
publishing principle for unified versioning, which is aut. verified but seems not 
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Table 2. System (feature) comparison along the dimensions coverage, r(ecentness), 
access and (q)uality. 


Dimension Coverage r Access q 
System name TY DO|IM |DI UP UV SV ID|PE OF MA | TE 
Archivo A I el e e e e o |e e ce e 
Bioportal all S |-/e|- ? |o |- |ol - |efe - 
LOV C,A,I | I o/- |o e o |- |o |e e o/o - 
OBO foundry C S --i-- o |- e |- e -/ e 
Ontobee I S |-/- |- |e l- - - |- |e e - 
Ontohub.org D I -/e! |- pail is - |o o | et/- 
OntoMaven repo | A - -eí - - - |- |e |e e o e 
Ont. Lookup Svc |I S |-/- |- e - - |e e/e - 
Dash represents no, white/black filled circle represent aA /full support; TY: 
system type - (A)rchive, (C)atalog, (I)ndex, (D)evelopment Platform; DO: ont. 


domain focus - (S)pecialized vs. (I)ndependent; IM: ont. import - fully autom- 
atized user inclusion requests/file submissions of new ontologies; DI: aut. ont. 
discovery; UP: aut. update of ont.; UV: unified ont. versioning labels; SV: aut. 
semantic versioning of ont.; ID: stable ont. (version) id (IRI); PE: persistent 
ont. version access for id; OF: access to ont. in one unified format; MA: system 
ont. metadata access - REST API/SPARQL; TE: flexible aut. testing of ont. 


consistency and conformity. 

1account/login required; “per ontology setting; Simported repos not in sync anymore; 
^reported, not accessible; "depending on used mvn repository systems; Snot working 
due to missing void file 


enforced (review revealed non-uniform versioning labels). With regard to ontol- 
ogy citation or dependency management of ontologies, Archivo and OntoMaven 
(we were not able to find any hosted ontology though) qualify by providing uni- 
fied and stable, abstract identifiers (independent of the archiving system and 
ontology serialization) for ontologies and its version, while taking extra effort to 
achieve persistent access to the ontology for these identifiers. Besides Bioportal 
all systems try to reduce the variety of ontologies by supplying every ontology 
in at least one unified format. Versioning/ontology system metadata access for 
Archivo is designed to work via RDF and SPARQL, at the time of writing there is 
only a very basic REST API (and Linked Data interface) available. Both OBO f. 
and Archivo leverage a continuous, flexible/customizable testing system which is 
coordinated and performed at a central place to report issues and improve qual- 
ity, in contrast to Ontohub and OntoMaven focussing on custom tests from/for 
publishers. 

The comparison clearly shows that Archivo addresses a gap and is, to the 
best of our knowledge, the only system which tries to tackle the (most) user 
challenges at web-scale and a consumer can rely on that the archived ontology 
retrieved by a timestamp version resolves to the one that had been served by 
the ontology authority/domain at that time (no uploader hijacking and curator 
errors possible). 
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7 Future Work 


On a conceptual level, we would like to develop Databus mods”? further in order 
to allow users to augment the archived ontologies with modular contributions 
(e.g. labels for another language, mappings, another validation report, custom 
star ratings, etc.). This could strengthen the idea of a platform economy - users 
contribute what they are in need of for other users. From a technical perspective 
we plan to implement the Memento protocol for the Databus/Archivo and offer 
ontology publishers to use Archivo as “plug and play Memento as a service” 
for their ontologies, to support adoption of Memento and to not take away URI 
ownership and traffic from the publishers. We also plan to integrate more exist- 
ing ontology repositories to increase the coverage for other domains. We aim 
to further enhance existing Databus tools, such that they improve support for 
special aspects of ontology consumption (e.g. automatic client side conversion 
of ontology formats and ontology import dependency rewriting with Databus 
client). 


Acknowledgments. This work was partially supported by grants from the Federal 
Ministry for Economic Affairs and Energy of Germany (BMWi) for the LOD-GEOSS 
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Abstract. In the field of domestic cognitive robotics, it is important to 
have a rich representation of knowledge about how household objects are 
related to each other and with respect to human actions. In this paper, 
we present a domain dependent knowledge retrieval framework for house- 
hold environments which was constructed by extracting knowledge from 
the VirtualHome dataset (http://virtual-home.org). The framework pro- 
vides knowledge about sequences of actions on how to perform human 
scaled tasks in a household environment, answers queries about house- 
hold objects, and performs semantic matching between entities from 
the web knowledge graphs DBpedia, ConceptNet, and WordNet, with 
the ones existing in our knowledge graph. We offer a set of predefined 
SPARQL templates that directly address the ontology on which our 
knowledge retrieval framework is built, and querying capabilities through 
SPARQL. We evaluated our framework via two different user evaluations. 


Keywords: Ontology - Cognitive robotics - Knowledge retrieval 
framework - Semantic similarity 


1 Introduction 


Ontologies have been used in many cognitive robotic systems which perform 
object identification [8, 22, 31], affordances detection (i.e. the functionality of an 
object) [2,16,25], and for robotic platforms that work as caretakers for people 
in a household environment [20,34]. We can see an extensive survey on these 
topics in [9]. In this paper, we introduce a novel knowledge retrieval framework! 
for household objects and actions that can be used as part of the knowledge 
representation component of a cognitive robotic system, which is connected with 


! https://github.com/valexande/HomeOntology. 
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a custom made semantic matching algorithm to enrich its knowledge. Moreover, 
to the best of our knowledge our ontology is the largest one about objects and 
actions, as well as activities (i.e. set of object-action relations). 

Common Sense (CS) knowledge is an aspect that is desired by any Artificial 
Intelligence (AI) system. Eventhough, there are no strict definitions on what we 
should consider CS knowledge. Our knowledge retrieval framework can help tackle 
queries that require CS reasoning, on how objects are related, and how we can per- 
form a human scaled task. Some example queries are “What actions can I perform 
with a pot?”, or “What other objects are related to knife, plate, and fork?”, or even 
“What can I turn on if I am in the living room?”. Furthermore, our framework can 
recommend sequences of actions on how to perform a human scaled task, like “How 
can I make a sandwich?”. Our framework is based on a domain-specific ontology 
that we have developed which contains knowledge from the VirtualHome dataset 
[17,23]. The ontology is built in OWL [19] and the Knowledge Base (KB) can be 
easily extended by adding new instances of objects, actions, and activities. 

Due to the fact that the VirtualHome dataset covers a restricted set of 
objects, in order to be able to retrieve knowledge about objects on a larger scale, 
we developed a mechanism that can take advantage of external open knowledge 
bases in order to retrieve knowledge or answer queries about objects that do 
not exist in our KB. To this end, we have devised a semantic match making 
algorithm that retrieves semantically related knowledge out of three web knowl- 
edge graphs, namely DBpedia [5], ConceptNet [18], and WordNet [30]. When our 
framework cannot find an entity in its own KB, it uses the knowledge existing in 
the aforementioned KBs, to relate the unknown entity with one in our local KB. 
Also, the framework can provide some general knowledge about objects such 
as “How much fat does a banana have?”, with predefined SPARQL query tem- 
plates addressed to DBpedia. We notice that our framework performs semantic 
matching only with the aforementioned ontologies. 

The knowledge retrieval framework was evaluated with two different user eval- 
uation methods. In the first one, 42 subjects were asked on how satisfied they were 
with the returned answers on different query categories. The results seem promis- 
ing with a 82% score. While in the other evaluation, we gathered a gold standard 
dataset for a set of queries that our framework can answer, from a group of 5 per- 
sons not part of the first group. Then, we asked a group of 34 people to give us 
answers to the same queries using only information from each dataset, and we 
compared these with the answers of our knowledge retrieval framework. 

The rest of the paper is organized as follows. In Sect. 2, we present the related 
work. In Sect. 3, we describe our approach and the architecture of our knowledge 
retrieval framework. Next, in Sect. 4 we present the results of the user evaluation. 
Finally, in Sect.5 we give a discussion and the conclusion. 


2 Related Work 


Our study balances between two fields. Firstly, our knowledge retrieval frame- 
work can be fused in a cognitive robotic system acting in a household envi- 
ronment. The cognitive robotic system will then enhance its knowledge about 
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which objects are related, object properties, affordances understanding, and to 
semantically connect entities in its KB with entities in DBpedia, ConceptNet, 
WordNet. Secondly, if one considers only the ontology part of our work then 
this ontology would be close to other Linked Open Data KBs about products, 
and household objects. For the first case, we need to mention that our study 
can stand only as part of the knowledge representation component of a cognitive 
robotic system that can fill reasoning gaps. 

Property extraction and creation methods, between objects in a household 
environment, have been implemented in many robotic platforms [8,22,33]. Usu- 
ally an object identification is done based on the shape and the dimensions 
perceived by the vision module, or in some cases [2,31] reasoning mechanisms 
such as grasping area segmentation, or a physics based module contribute to 
understand an object’s label. In [27], spatial-contextual knowledge is used to 
infer the label of an object, for example the object x is usually found near 
objects y1,---,;Yn, or x is found on y. Even though these are state of the art 
frameworks, the robotic platform has to extract information from two or more 
different ontologies, in order to link an object with an affordance. 

The aspect of affordances understanding based on an ontology, mainly with 
OWL format, is widely studied. In [16,25], authors try to understand affordances 
by observing human motion. They capture the semantics of a human movement, 
and correlate it with an action label. On the other hand, Jager et al. [13] have 
connected objects with physical and functional properties, but the functional 
properties which can be considered as affordances, capture a very abstract con- 
cept, as they define only the properties containment, support, movability, block- 
age. Similarly, Befler et al. [3] define 18 actions that can be performed on objects 
if some preconditions hold in each case, such as if the objects are reachable, the 
material of the object, among others. The affordances existing in our knowledge 
retrieval framework are more than 70, combined with other features. Thus, we 
can offer greater plurality from frameworks like the aforementioned ones. 

Our study attempts to fill the gap found in the previous studies and develop 
a knowledge retrieval framework that would complete the missing knowledge. 
Our framework, compared to the previous ones can offer: (i) a predefined KB of 
objects related to actions, (ii) a KB with sequences of actions to achieve human 
scaled tasks, and (iii) a mechanism that uses semantic match making between 
an entity that does not exist in our KB with an entity in the KB. 

Our semantic matching algorithm was mostly inspired by the works of Young 
et al. [35], and Icarte et al. [12] where they use CS knowledge from the web 
ontologies DBpedia, ConceptNet, and WordNet to find the label of unknown 
objects. As well as from the studies [6,36], where the label of the room can be 
understood through the objects that the cognitive robotic system perceived from 
its vision module. One drawback that can be noticed in these works, is that all 
of them depend on only one ontology. Young et al. compares only the DBpedia 
comment boxes between the entities, Icarte et al. acquires only the property 
values from ConceptNet of the entities, and [6,36] on the synonyms, hypernyms, 
and hyponyms of WordNet entities. 
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As for the second part, our study can be compared with an already exist- 
ing product ontology, such as the product ontologies found in [24,32], the more 
recent [28], and the general purpose ontology GoodRelations [10]. Our difference 
is that these ontologies offer information about objects, geometrical, physical, 
and material properties, and create object taxonomies and hierarchical relations. 
Instead, we have implemented knowledge about object affordances and we rep- 
resent knowledge, about objects through their affordances. Furthermore, O-Pro 
[4] is an ontology for object-affordance relations, but is considerably smaller 
with respect to the quantity of objects and affordances. Thus, to the best of our 
knowledge we offer the largest ontology about object affordances, in a household 
environment. 


3 Our Approach 


In this section, we describe in detail the architecture and the different aspects 
of our knowledge retrieval framework. In the first subsection, we describe the 
dataset from which we took knowledge and fused in our schema. Next, we present 
the ontology that is the main component of our framework. In the last subsection, 
we describe the algorithm that semantically matches entities from DBpedia, 
ConceptNet, and WordNet, with entities in our KB. 


3.1 Household Dataset 


The VirtualHome dataset [17,23] contains activities that people do at home. 
For each activity, there are different descriptions on how to perform them. The 
descriptions are present in the form of sequence of actions, i.e., steps that con- 
tain an action related with an object or objects, illustrated in Example 1. More- 
over, the dataset offers a virtual environment representation for each sequence 
of actions with Unity’. The dataset contains ~2800 sequences of actions, for 
human scaled activities. Moreover, the dataset holds more than 500 objects, 
usually found in a household environment, which are semantically connected 
with each other, and with specific human scaled actions. 


Example 1. Browse Internet 
Comment: walk to living room. look at computer. switch on computer. sit in 
chair. watch computer. switch off computer. 


Walk] (living room) (1) 
Walk] (computer) (1) 
Find] (computer) (1) 
TurnTo] (computer) (1) 
LookAt] (computer) (1) 
SwitchOn] (computer) (1) 
Find] (chair) (2) 


? https://unity.com. 
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[Sit] (chair) (2) 
[Watch] (computer) (1) 
[SwitchOff] (computer) (1) 


Each sequence of actions has a template: (a) Activity Label, (b) Comment, 
i.e. small description, and (c) the sequence of actions. Each step has the general 
form shown in (1): 


[Action]|(Objecti) (LD1) ... (Object) (1D) (1) 


where Action is the human scaled action, Objecti,..., Object, are the objects 
on which the action is performed (n € N), and 1D,...,ID, are the unique 
identity numbers between the objects that represent the same natural object. 
In our experiments we have approximately 500 objects, but due to the fact that 
the ontology can be freely extended with objects, we consider n as a natural 
number. 


3.2 Ontology 


The main component of our knowledge retrieval framework is the ontology that 
was inspired by the VirtualHome dataset. Figure 1a presents part of the ontology 
concepts, while Fig. 1b the relationships between the major concepts. 

The class Activity contains some subclasses which follow the hierarchy pro- 
vided by the dataset; these were hand-coded. Moreover, the instances of these 
classes are the sequence of actions presented in the KB of the dataset. The class 
Activity is connected through the property listOfSteps with the class Step. Addi- 
tionally, the class Step is connected through the properties object and step_type 
with the classes ObjectType and Step Type, respectively. Next, the class Object- 
Type contains the labels of all the objects found in the sequences. On the other 
hand, the class StepType is similar to ObjectType as it gives natural language 
labels to the steps. 

We have represented every sequence of actions as a list, because this gave us 
stronger coherency and interaction on the knowledge provided by the activity. 
Thus, we can answer queries like ^What is the third step in the sequence of 
activity X?”, or “Return all the sequences where firstly I walk to the living room, 
then I open the TV, and after that I sit on the sofa”, information very crucial for a 
system with planning capabilities. Also, we have developed an instance generator 
algorithm that transforms the sequences of actions from the form of Example1 
into instances of classes in our ontology. The class that the sequence belongs to, 
is provided by the Activity label. We give such an instance in Example 2. 


Example 2. 
: browse.internet132 rdf:type :Browselnternet ; 
:listOfSteps ( 
:walk1607 :walk1608 :find1609 
:turntol610 :lookatl1611 :switchon1612 
: find1613 :sitl614 :watchl615 :switchoff1616 ) ; 


rdfs:comment ‘‘walk to living room...” ; 
? 
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is-a 
HouseCleaning 


*ObjectType *StepType 
| 
is-8 
Abject Step type 


“Step | 


ObjectType | is-a 
| 
| is-a lisiofSteps 
= Activities | 
HygieneStyling 
(a) 


Fig. 1. (a) Part of Ontology Scheme (b) Ontology Properties. 


Each step shown in the property listOfSteps is an instance of the class 
Step. Each step has a unique ID that distinguishes it from all the other steps. 
Example 3 shows an instance step from the listOfSteps, and Example 4 the object 
and action with which the instance is connected from the Object Type and Step- 
Type classes. 


Example 3. 
:walk1608 rdf:type :Step ; 
:object :computerl ; 
:steptype : walk 
Example 4. 


:computerl rdf:type :ObjectType; 
rdfs:label *'computer" Gen. 


:walk rdf:type :StepType; 
rdfs:label ‘‘ walk” Gen 


After constructing and populating the ontology, we have developed a library 
in Python that constructs SPARQL queries addressed to the ontology and fetches 
answers. The library consists of 9 predefined query templates that represent the 
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most probable question types to the household ontology. These templates were 
consider as more important after an extensive literature review of studies about 
cognitive robotic systems that act in a household environment [9]. Among many 
other studies, we have considered primarily KnowRob [2,31], RoboSherlock [1], 
RoboBrain [29], and RoboCSE [7]. We managed to find what were the most 
common and crucial queries addressed to a cognitive robotic system and we con- 
structed these templates based on these findings. Example 5 shows the SPARQL 
template that returns the objects which are related to two other objects, Object1 
and Object2. 


Example 5. 
SELECT DISTINCT ? object WHERE{ 
?instance :listOfSteps ?list. 
7list rdf:rest«/rdf:first ?element. 
?element :object ?object 
SELECT DISTINCT ?instance WHERE{ 
?intl rdfs:subClassOf : Activity. 
?inl rdfs:subClassOf ?intl. 
?instance rdf:type ?inl; 
:listOfSteps ?listl. 
7listl rdf:rest*/rdf:first ?stepl. 
?stepl :object :Objectl 
?int2 rdfs:subClassOf : Activity. 
?in2 rdfs:subClassOf ?int2. 
?instance rdf:type ?in2; 
:listOfSteps ?list2. 
7list2 rdf:rest«/rdf:first ?step2. 
7step2 :object :Object2.1 


Alternatively, ad-hoc SPARQL queries can be asked to the ontology, such as 
Example 6 were an user wants to see the objects involved in the activity, activity. 


Example 6. 

SELECT DISTINCT ?object WHERE{ 
:activityl :listOfSteps ?list 
7list rdf:rest«/rdf:first ?step 
"step :object ?object } 


Therefore, users can hand pick one of the predefined queries and then give the 
keywords that are needed in order to fill the SPARQL template (Example 5), or 
they can write their own SPARQL query to access the information they desire 
(Example 6). 


3.8 Semantic Matching Algorithm 


Due to the fact that the dataset upon which the knowledge retrieval framework 
was constructed has a finite number of objects, in order to be able to retrieve 
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knowledge about objects on a larger scale, we developed a mechanism that can 
take advantage of the web knowledge graphs DBpedia, ConceptNet, and Word- 
Net to answer queries about objects that do not exist in our KB. This would 
broaden the range of queries that the framework can answer, and would overcome 
the downside of our framework being dataset oriented. Algorithm 1 was imple- 
mented using Python. The libraries Request and NLTK® offer web APIs for all 
three aforementioned ontologies. Similar methods can be found in [12,35], where 
they also exploit the CS knowledge existing in web ontologies. Algorithm 1 starts 
by getting as input any word that is part of the English language; we check this 
by obtaining the WordNet entity, line 3. The input is given by the user implic- 
itly, when he gives a keyword in a query that does not exist in the KB of the 
framework. 

Subsequently, we turn to ConceptNet, and we collect the properties and 
values for the input word, line 4. In our framework, we collect only the values 
of some properties such as RelatedTo, UsedFor, AtLocation, and IsA. We choose 
these properties because they are the most related to our target application 
of providing information for household objects. Also, we acquire the weights 
that ConceptNet offers for each triplet. These weights represent how strong the 
connection is between two different entities with respect to a property in the 
ConceptNet graph, and are defined by the ConceptNet community. Therefore, 
we end up with a hash map of the following form: 


{ Property, : [(entity], weight; ) pa (entity), weight?,)] iubens 


Property, : [(entityl , weight!) pears (entity},, weightt,) | ) 


for m,l, k € N\{O}. 

Then, we start extracting semantic similarity between the given entity and 
the returned property values using WordNet and DBpedia, lines 5-8. Firstly, we 
find the least common path that the given entity has with each returned value 
from ConceptNet, in WordNet, line 9. The knowledge in WordNet is in the form 
of a direct acyclic graph with hyponyms and hypernyms. Thus, in each case 
we obtain the number of steps that are needed to traverse from one entity to 
another. Subsequently, we turn to DBpedia to extract comment boxes of each 
entity using SPARQL, lines 11-13. If DBpedia does not return any results, we 
search the entity in Wikipedia, which has a better search engine, and with the 
returned URL we ask again DBpedia for the comment box, based on the mapping 
scheme between Wikipedia URLs and DBpedia URIs, lines 14-20. Notice that 
when we encounter a redirection list we acquire the first URL of the list which 
in most cases is the desired entity, and acquire the comment box. 

'The comment box of the input entity is compared with each comment box of 
the returned entities from ConceptNet, using the TF-IDF algorithm to extract 
semantic similarity, line 21. Here we follow a policy which prescribes that the 
descriptions of two objects which are semantically related will contain common 


3 https:/ /www.nltk.org. 
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words. We preferred TF-IDF despite its limitations, as it may miss some words 
only from the difference of one letter, because we did not want to raise the 
complexity of the framework using pre-trained embedding vectors like Glove 
[21], Word2Vec [26], or FastText [14], this remains as future work. 


Algorithm 1: Semantic Matching Algorithm 


1 Input: entity 
2 Output: hash semantic.similarity 
3 if entity in WordNet then 
4 hash. property.values = get.ConceptNet. property values(entity) 
5 Comment. Box Input = get.DBpediaCommentBox(entity) 
6 hash. semantic.similarity = {} 
7 for property in hash property. values do 
8 for value in hash_property_values[property] do 
9 WordNet. Path = get. WordNetPath(entity,value) 
10 Commnet. Box — () 
11 if value in DBpedia then 
12 | Comment_Box = get.DBpediaCommentBox(value) 
13 end 
14 if Comment Box = ( then 
15 wiki entity = get. WikipediaEntityURL(value) 
16 Comment. Box = get.DBpediaCommentBox(wiki entity) 
iT end 
18 if Comment Box = ( then 
19 | continue 
20 end 
21 weight TFIDF = TF-IDF(Commen. Box Input,Comment. Box) 
hash.semantic.similarity[property| = (value, Similarity (entity, 
value)) 
22 end 
23 end 
24 hash. semantic.similarity = sorted(hash semantic. similarity) 
25 end 


In order to define the semantic similarity between the entities, we have 
devised a new metric that is based on the combination of WordNet paths, TF- 
IDF scores, and ConceptNet weights Eq. (2). We choose this specific metric 
because it takes into consideration the smallest WordNet path, the ConceptNet 
weights, and the TF-IDF scores. TF-IDF and ConceptNet scores have a posi- 
tive contribution to the semantic similarity of two words. On the other hand, 
the bigger the path is between two words in WordNet the smaller the semantic 
similarity is. 


1 

Sim(i, v) CWNP,v) + TFIDF(i,v) -- CNW(i,p,v) (2) 
In Eq.2, i is the entity given as input by the user, and v is each one of the 
different values returned from ConceptNet properties. CNW (i, p, v) is the weight 
that ConceptNet gives for the triplet (i, p, v), and p stands for the property that 
connects i and v. TFIDF(i,v) is the score returned by the TF-IDF algorithm 
when comparing the DBpedia comment boxes of i and v. WNP(i,v) is a two 
parameter function that returns the least common path between 2 and v, in the 

WordNet direct acyclic graph. 
In case i and v have at least one common hypernym (ch), then we acquire the 
smallest path for the two words, whereas in case i and v, do not have a common 


Household Ontology 45 


hypernym (nch), we add their depths. Let depth(-) be the function that returns 
the number of steps needed to reach from the root of WordNet to a given entity, 
then: 


. mincec {depth(i) + depth(v) — 2 x depth(c)) ch 
Har { pee + T 2 nch (3) 
where C is the set of common hypernyms for i and v. WNP (.,-) will never be 
zero, as two different entities in a direct acyclic graph will always have at least 
one step path between them. 

'The last step of the algorithm sorts the semantic similarity results of the 
entities with respect to the ConceptNet property, and stores the new information 
into a hash map, line 24. An example of the returned information is given in 
Example 7 where the Top-5 entities for each property are displayed, if there exist 
as many. 


Example 7. coffee IsA: stimulant, beverage, acquired taste, liquid. 
coffee AtLocation: sugar, mug, office, cafe. 
coffee Related To: cappuccino, iced. coffee, irish. coffee, turkish. coffee, 
plant. 
coffee UsedFor: refill. 


4 Evaluation 


We evaluated our knowledge retrieval framework via two different user evalu- 
ations. Firstly, by asking people how much they are satisfied with the results 
returned. Basically, we wanted to see if the answers returned by our framework 
satisfied the users in terms of CS. Due to the fact that we cannot define strict 
rules on what can be considered as CS, each subject gives their personal opin- 
ion to evaluate how satisfied they are with each answer. Thus, we asked for a 
score from 1 to 5 to eight categories of queries. Each person had to evaluate 40 
answers (5 queries of each of the eight categories). Subjects were presented with 
the Top-5 answers returned for each query. We tried to find people both related to 
Computer Sciences (CSc) and people not related to Computer Science (N-CSc), 
resulting in 19 and 23 subjects, respectively. We also made another clustering 
with the same people based on their education level, Workers 13 (W) that did 
not go to University, Bachelor/Master Students 23 (B/M), PhD Students 6 (P). 

'The categories of queries that were evaluated are the following: Q1: *On what 
objects can I perform the actions X1,.., Xn if I am in room Y?”, Q2: “On what 
objects can I perform the actions X1,..,Xn?", Q3: “What can I do with objects 
O1,...,Om?", Q4: “What objects are related to objects O1,...,0m?”, Q5: “Give me 
the category of activities for X^, Q6: “Give me related objects to O1,..., Om", QT: 
“Give me similar action(-s) to A", and Q8: “Recommend an Activity based on 
the description A”. Notice that in Q4, we addressed queries with objects that do 
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not exist in our KB, to see how satisfied people are with the recommendations 
from Algorithm 1. Table 1 and Table2 present the Mean and Variance scores, 
respectively. The results are rounded to two decimals in all the tables. 


Table 1. Table with Mean scores for Q1-Q8. 


General W |B/M P .CSc|N-CSc 
Ql 4.20 4.18 | 4.17 | 4.22 | 4.21 | 4.29 
Q2 4.35 4.36 | 4.39 | 4.32 | 4.39 | 4.35 
Q3 4.08 4.08 | 4.16 | 4.06 | 4.10 | 4.08 
Q4 3.73 3.72 | 3.70 | 3.74 | 3.72 | 3.73 
Q5 4.19 4.16 | 4.24 | 4.16 | 4.16 | 4.18 
Q6 4.11 4.09|4.12 | 4.10 | 4.11 | 4.09 
QT 3.99 4.09 | 3.97 | 3.95 | 3.91 | 4.06 
Q8 4.12 4.10| 4.16 | 4.25 | 4.02 | 4.09 
Mean 4.10 4.1 |4.11 4.10 | 4.08 | 4.10 


Table 2. Table with Variance scores for Q1-Q8. 


General W B/M P  CSc|N-CSc 
Q1 0.96 1.52 0.95 |0.78 1.20 | 0.74 
Q2 0.80 0.83 0.76 |0.92 | 0.71 | 0.87 
Q3 1.12 0.5 |1.44 |0.91 | 1.25 | 1.06 
QA 1.52 1.06 | 1.52 | 1.69 1.51 | 1.51 
Q5 1.61 1.54 1.59 | 1.65 1.73 | 1.49 
Q6 0.98 0.86 | 0.95 | 1.09 | 0.97 | 0.98 
Q7 1.75 1.75 1.82 | 1.38 1.88 | 1.66 
Q8 1.56 1.46 1.54 | 1.74 1.64 | 1.52 
Variance | 1.20 1.13 1.21 | 1.21 1.26 1.13 


As we can see, we obtained an overall of 4.10/5, which translates to an 
8296 score. Moreover, regarding the low score of Q4 in comparison to other 
queries we can comment the following. This happened because we had a very 
high threshold value to the Ratcliff-Obershelp string similarity metric, which 
compared the returned results from Algorithm 1 with the ones in our KB. On 
top of that, we did not display the recommendation from Algorithm 1; instead, 
we displayed the entity from our KB with which the result of Algorithm 1 was 
close enough. The threshold was 0.8 and we reduced it to 0.6; for smaller values 
the recommendations of Algorithm 1 in most cases were not related to our target 
application. Therefore, we reduced the value of the threshold and displayed the 
web KB recommendation. We performed these changes in order to affect only 
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Q4. The new results are displayed in Table 3. We observe that the Mean score 
for Q4 increased by 13.5%, and the Variance shows that the scoring values came 
closer to the Mean value by 0.89. 


Table 3. Table with Mean and Variance scores for Q4, with the new changes. 


p 


Mean | 4.41 4.35 | 4.36 46 4.45 | 4.36 
Variance 0.61 0.74 0.65 (0.39 0.42 0.73 


In our second evaluation, we asked from 5 subjects not part of the first 
group to give us their own answers in the queries Q1-Q7, apart from Q4 (we 
shall denote this by Q1-Q7\Q4). We omitted Q4 and Q8 because we consider 
them as less important for evaluating the capabilities of our knowledge retrieval 
framework. More specifically, from the viewpoint of a user Q4 is similar to Q6, so 
there was no point asking it again. On the other hand, for the Q8 the 5 subjects 
were reluctant to answer it because they considered it very time consuming (it 
required to provide 25 full sentences; not just words as in the case of the other 
queries), so we could not gather a quantitatively appropriate dataset. Therefore, 
the 5 subjects had to give us 5 answers based only on their own opinion for 5 
queries from each one of Q1-Q7\Q4. We resulted with a baseline dataset of 125 
answers for each query. Next, 34 subjects from the first evaluation agreed to 
proceed with the second round of evaluation. Each one had to give one answer, 
for 5 queries from each one of the queries Q1-Q7\Q4 (5 * 6 2 30 answers in total) 
picked from the aforementioned dataset. 


Number of correct answers in first i choices 


(4) 


T. j = 
ES Number of answers in users category j 

where i = 1,3,5, and j € (W, B/M, P, CSc, N-CSc}. Then, we compared these 
answers with what our knowledge retrieval framework returned to each query in 
the first choice (Top1), the three first choices (Top3), and in the five first choices 
(Top5). The results are in Table 4, and they show the precision of the system 


Eq. (4). 


Table 4. Table with Top1—Top3—Top5d scores. 


General | W B/M |P CSc | N-CSc 
Topl|T71.196 | 71.1% | 72.9% 63.3 | 71.496 | 71.0% 
Top3 | 80.7% | 71.1% | 82.2% | 75.8% | 81.696 | 80.1% 
Top5 | 84.1% | 82.7% | 89.6% | 83.396 | 83.8% | 84.3% 


We see that we achieved a 71.1% score in the Topl1 results returned by our 
knowledge retrieval framework, which is high if we take into consideration that 
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this is not a data driven framework which could learn the connections between 
the queries and answers, nor use embeddings between queries and answers that 
could point to the correct answer, therefore we gave a margin of error. Hence, we 
also display the Top3 and Top5 choices, where we see significant improvement 
by 9.6% and 13%, respectively. 

Evaluation Discussion: The evaluation unfortunately could not be done 
with immediate interaction with the framework, as we have not yet developed 
a Web API. For the first evaluation, the subjects were given spreadsheets with 
the queries and their answers and they had to evaluate each one of them. As for 
the second part, 5 subjects not part of the first group where given the queries 
Q1-Q7\Q4, and they had to give their own answer, from where we collected the 
gold standard dataset. This procedure was done again through spreadsheets. 
Subsequently, 34 subjects from the first evaluation were asked to answer Q1- 
Q7\Q4 using as options the words from the gold standard dataset. Therefore, 
the latter group were given the stack of potential answers for each query and a 
spreadsheet with the queries Q1-Q7\Q4. 

Considering to potential biases we notice that between the first and second 
evaluation there was a time lapse of over 40d, so we doubt that any of the sub- 
jects remembered any answer from the first evaluation. Secondly, the queries were 
formed after an extensive literature review of what is commonly considered as cru- 
cial knowledge for cognitive robotic systems interacting with humans in a house- 
hold environment. Furthermore, although we have 9 predefined SPARQL tem- 
plates we have used only 8 of them in the first evaluation; this is because the one 
that was omitted involves the activities that were part of the VirtualHome dataset, 
so we have considered that this was already evaluated by previous related work. 

Finally, looking at the results of the evaluation we drive the following con- 
clusions. Firstly, the large percentage (8296) on how much satisfied with the 
answers of our knowledge retrieval framework the subjects are, signifies that our 
framework can be used by any cognitive robotic system acting in a household 
environment as a primary (or secondary) source of knowledge. Secondly, the sec- 
ond method of evaluation implies that our knowledge retrieval framework could 
be used as a baseline for evaluating other cognitive robotic systems acting in 
a household environment. Thirdly, the scores that Algorithm 1 achieved, show 
that it can be used as an individual service for semantically matching entities of 
a knowledge graph with entities from ConceptNet, DBpedia, and WordNet as it 
can be easily extended with more properties. 


5 Discussion and Conclusion 


In this paper, we presented a knowledge retrieval framework that can be fused in 
a cognitive robotic system that acts in a household environment, and an ontol- 
ogy schema. More specifically, we extracted information from the VirtualHome 
dataset to fuse it into our framework. Furthermore, with an instance genera- 
tor algorithm we translated the activities as instances of the ontology classes. 
Therefore, we obtained knowledge, about how actions and objects are related, 


Household Ontology 49 


what objects are related with each other, what objects and actions exist in an 
activity, and suggestions on how to perform an activity in a household envi- 
ronment, through a set of predefined SPARQL query templates. The knowledge 
retrieval framework can also address hand-coded SPARQL queries to its own 
KB. Additionally, we broadened the range of queries the framework can answer, 
by developing a Semantic Matching Algorithm that finds semantic similarity, 
between entities existing in our KB and entities from the knowledge graphs of 
DBpedia, ConceptNet, and WordNet. 

'The problem of building an ontology schema that contains a wide variety 
of instances and properties, is well studied [11,15]. The same does not hold 
when we try to fuse CS knowledge in an KB, therefore usually methods that 
acquire CS either from a local KB, or a combination of local and web KBs are 
used. Unfortunately, fusing CS knowledge and reasoning in an ontology is not 
a very well-studied area, and the methods presented until now can rarely be 
generalized. CS knowledge and the capability of a cognitive robotic system to 
answer CS related queries offers flexibility. 

We consider that we made a contribution in this direction by presenting a 
knowledge retrieval framework that can provide knowledge to a cognitive robotic 
system to answer questions that require CS reasoning. Looking at the results of 
our two evaluations we can conclude that our approach has a merit towards 
our aims. Firstly, the 8296 score in the first evaluation where the users had to 
evaluate the answers based on their own CS, implies that our framework can 
provide knowledge for CS questions in a household environment. Additionally, 
the scores in the second evaluation show that the knowledge retrieval framework 
can be used as a baseline for evaluating other frameworks. 

As for future work, we are planning to extend the scheme of the ontology 
with spatial information about objects, for example soap is usually found near 
sink, sponge, bathtub, shower, shampoo. Also, we plan to broaden the part of the 
framework which returns general knowledge about objects, by extracting knowl- 
edge from more open web knowledge graphs, in addition to DBpedia. Finally, 
we aim to extend the Semantic Matching Algorithm by obtaining information 
from other ontologies. 
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Abstract. Semantic technologies offer significant potential for improv- 
ing data search applications. Ongoing work thrives to equip data cata- 
logs with new semantic search features to supplement existing keyword 
search and browsing capabilities. In particular within the social sciences, 
searching and reusing data is essential to foster efficient research. In 
this paper, we introduce an approach and experimental results aimed at 
improving interoperability and findability of social sciences survey items. 
Our contributions include a conceptual model for semantically repre- 
senting survey items and questions, detailing meaningful dimensions of 
items, as well as experimental results geared towards the automated 
prediction of such item features using state-of-the-art machine learning 
models. Dimensions of interest include, for instance, references to geolo- 
cation and time periods or the scope and style of particular questions. We 
define classification tasks using neural and traditional machine learning 
models combined with sentence structure features. Applications of our 
work include semantic and faceted search for questions as part of our 
GESIS Search. We also provide the lifted data as a knowledge graph via 
a SPARQL endpoint for further reuse and sharing. 


Keywords: Question feature extraction - Social sciences survey data - 
Semantic data modelling - Natural language processing 


1 Introduction 


In the social sciences, questionnaire-based survey programs are the instrument 
of choice to collect information from a particular population. This survey data 
usually comprises attitudes, behaviours and factual information. To collect sur- 
vey data, a research team usually composes a dedicated questionnaire for a 
population group and collects the data in personal interviews, telephone inter- 
views, or online surveys. As this process is very complex and time-consuming, 
social scientists have a strong need for re-using both actual survey results for 
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secondary analysis [8] as well as well-designed and constructed survey items, 
e.g. specific questions. In Germany, GESIS - Leibniz Institute for the Social 
Sciences! is a major data provider that gathers, archives and provides survey 
data to researchers from all over the world. Datasets are searchable through 
GESIS Search? or gesisDataSearch?. Current research on social scientists’ infor- 
mation needs indicates an increasing need for re-using survey data [17] and 
ongoing work already focuses on improving search applications with semantics 
e.g. from the users’ perspective [12]. 

A crucial factor in the process of finding and identifying relevant survey data 
is the quality of available metadata. Metadata includes general information like 
title, date of collection, primary investigators, or sample, but also more specific 
information about the study’s content like an abstract, topic classifications and 
keywords. So far, these metadata help to find a study of interest but they are less 
helpful if a researcher is interested in finding specific questions or variables. While 
a question is the text that is used to collect answers, variables contain the expres- 
sion of the answers’ characteristics. For example, the fictitious question “What 
is your attitude towards the European Union?” has the variable *AttiduteEU" 
which could have the characteristics (1) negative, (2) neutral, or (3) positive. Cur- 
rently, no dedicated vocabularies are used for capturing the semantics of variables 
or items, for instance, their scope, nature or georeferences (EU in this case). 

A common way for researchers to find variables and questions is to first find 
suitable datasets. In a second step, they read exhaustive documentation to find 
concrete questions or variables that fit their research question. For comparing 
variables and finding similar variables, this process has to be repeated. Recently 
introduced variable search systems* address this issue by providing a way to 
search for questions and variables with a common text-based search approach. 
However, the intention of a question or the concept to be measured are often 
not directly verbalized in its textual description. 

In this paper, we examine how a variable's content can be described more 
expressively, using state-of-the-art semantic technologies. Therefore, we focus 
on extracting and representing additional information from a question that go 
beyond tagging the questions with keywords and topics. Our approach is based 
on the ofness and aboutness concept of survey data introduced by [10]. While 
ofness refers to the literal question wording, which often reveals information 
about the topic of a question, the aboutness relates to the latent content. In our 
work, we focus on the aboutness aspects. This includes, for instance, the scope 
and nature of a question, e.g. whether the question asks for opinions or about a 
fact about the interviewee's life. 

'The so called question features are designed to complement each other and 
are formally modeled as RDF(S) data model. We also introduce experiments for 


1 https:/ /www.gesis.org. 

? https:/ /search.gesis.org/. 

3 https://datasearch.gesis.org/start. 

t like ICPSR https://www.icpsr.umich.edu [25], GESIS Search https:/ /search.gesis. 
org [14], paneldata.org https:/ /paneldata.org/. 
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supervised classification models able to automatically predict question features. 
As the focus of our work is more on the question features and their systematic 
we started with established classifiers leaving more recent approaches for future 
work. Experiments are conducted on a real-world corpus of frequently used sur- 
vey questions, consisting of 6500 distinct questions. For each question, we extract 
the question features by using a variety of text classification approaches, e.g., 
neural networks like LSTM. In addition, we generated a knowledge graph (KG) 
and publish the results via a dedicated SPARQL endpoint’. 

Our main contributions can be summarized as followed: (1) We provide a 
taxonomy of question features and (2) a comprehensive data model describing 
the questions and the features in relation to each other. Finally, (3) we pro- 
vide methods and first results for the prediction of one question feature, i.e. for 
populating a knowledge base of expressive question metadata. 

The paper is structured as follows. First, we provide the related work in 
Sect.2 and elaborate the design of the question features and the data model 
in Sect.3. Afterwards, in Sect.4, we describe our experiments on extracting 
the “Information type” question feature before we eventually close discussing 
application scenarios and draw a conclusion (Sect. 5). 


2 Related Work 


In this section, we discuss related work, including available survey data catalog 
systems, relevant RDF vocabularies for model design and methods for feature 
extraction. 

Some notable providers of social science survey data in Germany and inter- 
nationally are GESIS, LifBi(NEPS)°, SOEP/DIW’, pairfam® and ICPSR. These 
institutions allow their customers access to data and documentation on differ- 
ent levels. Smaller institutions are known for a narrow set of datasets, they do 
not host complex online catalogs but provide study documentation as HTML 
or PDFs online. However, sometimes they cooperate with larger institutions or 
consortiums that host their datasets. SOEP and pairfam, for example, take part 
in panaldata.org a data catalog for variables, questions, concepts, publications 
and topics. It provides text based search. Larger institutions like GESIS and 
ICPSR host large catalogs for study level data and sub-studylevel data. GESIS’ 
GESIS Search and ICPSR’s data portal are two examples for more complete 
search applications. Yet, to our knowledge there is no example of a variable cat- 
alog system that uses expressive and formally represented question features like 
the ones presented in this paper. 

For our data model, we investigated related RDF vocabularies. [13] out- 
lines best practices to consider when publishing data as Linked Open Data by 
e.g. reusing established vocabularies. Relevant work is found in vocabularies 


5 http://data.gesis.org/questionfeaturessample/site. 
8 https: //www.neps-data.de/Mainpage. 

T https: //www.diw.de/en/soep. 

8 https: //www.pairfam.de/en/. 
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describing scientific data e.g. the DDI RDF discovery vocabulary? [2,3]. It is 
based on the Data Documentation Initiative (DDI) metadata standard, which 
is an acknowledged standard to describe survey data in the social sciences. Dat- 
aCube!? focuses on statistical data. Large cross-domain vocabularies of relevance 
include Schema.org!! and DBpedia!?. Further candidates are upper-level vocab- 
ularies like DOLCE-Lite-Plus!*, as they serve more general terms and are not 
focused at specific domains. 

With respect to methodological work on classification of short text, e.g. for 
predicting question features, approaches include the ones surveyed by [1], where 
the authors provide a survey on text classification examples for different tasks 
like “News filtering and Organization”, “Document Organization and Retrieval”, 
“Opinion Mining” or “Email Classification and Spam Filtering” applying vari- 
ous approaches e.g. “Decision Trees”, “Pattern (Rule)-based Classifiers”, “SVM 
Classifiers” and many more. The authors elaborate also on the experimental 
setups and best practices. Similar work can also be found in [20]. The survey 
presented in [24] elaborates on the special aspects of short texts and popular 
work on classifiers using semantic analysis, ensemble short text classification 
etc. is introduced. In [5] the authors present an approach specialized for short 
text classification leveraging external knowledge and deep neural networks. A 
famous short text corpus and target of many classification/extraction tasks is 
Twitter!^. Our work relates for example to the extraction of specific dimensions 
e.g. sentiments [21] or events [27]. While individual approaches certainly overlap 
with ours, as they work on (rather arbitrary) short texts, our setup leverages 
specifics of survey questions which allows to compose our question features in a 
systematic way so that they complement each others and serve a common goal, 
i.e. better performance in a search system. 


3 Semantic Features of Survey Questions 


Before we introduce our taxonomy of question features, we give a closer descrip- 
tion of survey questions. 


3.1 Survey Questions 


A question in a questionnaire is described through a question text and predefined 
answer categories!?. Figure1 depicts three example questions. In some cases, 
when a group of questions differs in only the object they refer to, questionnaire 
designers assemble these in item batteries, where the items share question text 


? http:/ /rdf-vocabulary.ddialliance.org/discovery.html. 
10 https:/ /www.w3.org/TR/vocab-data-cube/. 
11 http:/ /schema.org. 
12 bttp:/ /dbpedia.org/ontology/. 
13 bttps:/ /www.w3.0rg/2001/sw/BestPractices/WNET /DLP3941. daml.html. 
14 pttps:/ /twitter.com. 
15 We do not consider open questions at the moment, as there are none in the data. 
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and answer categories. An example can be seen in Fig. 1 (question in the center). 
A variable corresponds to either a complete question when there is only one 
answer available, or a question item. In the remainder of the paper and in our 
dataset, we treat questions having several items as separate instances and refer 
to them as “question-item pairs”. Questions without items are likewise a single 
question instance. 


How is the head of state Do you think that the introduction of the new technologies in Canada 
over the next few years will make work ... 

Not applicable; head of 
i state not indirectly 

elected How frequently would you say you buy each of the following types of 3 
eic a ee a A 
Often | From time to | Rarely | Never mu 
demp TETE 
[s|wssmg | ENH 
Loose products (e.g. from 1 2 3 5 
e | | ^ Pt Et || 


[Pre-packed products | 1 | 2 —| 3 | 4 [s5] 


Fig.1. Example questions. CSES 2015 (left), ISSP 1997 (center) and Eurobarometer 
2018 (right) [9, 16, 26] 


Survey questions are not necessarily questions in the grammatical sense, i.e. 
a single sentence with a question mark at the end. Many questions incorporate 
introductory texts and definitions for clarification. Additionally, they are often 
formulated as requests for the respondent or they are prompts for supplement. 
Meaning they are formulated as the first part of a statement, stopping with 
*..."^ and leaving the second part to the respondent to complete. The question 
instances in our dataset have between one and 171 words with 29 words on 
average. 

Other properties documenting variables are an identifier, a label, interviewer 
instructions, keywords, topic classification, encoding in the dataset and more. 


3.2 A Taxonomy of Question Features 


We assume that a search session for a question starts with a topic or keyword 
search and is subsequently refined through the use of facets. Our taxonomy 
presented in the following focuses on the facets. Therefor it does not include 
features regarding the actual topic which can be extracted by e.g. topic mod- 
elling. For our semantic description, we identified recurrent patterns in survey 
questions through literature [23], elaboration with domain experts as well as 
brainstorming. We looked into more than 500 questions and question-item pairs 
from over 200 studies. From this we compiled an initial list of potential ques- 
tion features to be discussed individually with two experts who we trust. Our 
foremost interest was to identify relevant filter criteria for social scientists. Sub- 
sequently, we oriented us along the requirements needed for use cases such as 
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faceted search of items, questions or variables and identified some criteria any 
feature should adhere to. These include explicitness, distinctiveness, compre- 
hensibility, a discrete value range (which may be described through a controlled 
vocabulary), meaningfulness, recurrence in our dataset, annotatability (practi- 
cal'®) and extractability. 

We came up with a list of 11 question features involving features that describe 
the problem/task given to the respondent, e.g. the scene depicted, statements 
that can be made about the information asked, the tone and complexity of 
language or the nature of the object of the question. 

Our features are presented in the following. The list names the question fea- 
ture and provides a definition and the value range. For instance the question 
feature Time reference captures whether a question refers to the past, present 
or future of the respondents life, or whether a hypothetical scenario is depicted. 
Depending on the situation more than one value could be correct. I. e. the Infor- 
mation type was designed to be mutual exclusive. All question features are either 
of *- or 0..1- cardinality. The values are to be determined through individual 
approaches, e.g. a text classification or keyword matching, for example the value 
range for the question feature Geographic location is meant to correspond with 
the Geonames!” gazetteer. For reasons of conciseness, we omitted the definitions 
of the allowed values in the list. They are however presented online along with 
the KG documentation. 


Information type. The information type of a question characterizes which type 
of information the respondent is asked to state about the question object. 
Values: Evaluation (Sub-values: Willingness, Preference, Acceptance, Predic- 
tion, Assessment, Explanation), Fact (Sub-values: Demography, Participa- 
tion, Activity, Decision, Use, Interaction, Behaviour, Life Events), Cognition 
(Sub-values: Emotion, Knowledge, Perception, Interest, Motivation, Believes, 
Understanding). 

Focus. This feature characterizes the focus of the question object. Whether it is 
focused towards the respondent, another person or if it is wide as in a general 
question. Values: Self focus, External focus (Sub-values: Family/Member of 
family, Acquaintance, Affiliate, Public Person, Institution, Object focus/item 
focus, Event focus), Generic/universal focus and Self+external focus. 

Time reference. Time reference characterizes the question’s time reference wrt. 
past, present and future. Values: Past, Present, Future, Hypothetical - past, 
Hypothetical - present, Hypothetical - future. 

Periodicity. Periodicity characterizes the duration and periodicity of the time 
the question refers to. Values: Point in time, Time span, Periodic point in 
time, Unspecific. 


16 A human annotator needs to be able to work out the correct annotation with rea- 
sonable effort. E.g., long or nested value ranges, rare and too specific or too similar 
terms need to be avoided. 

17 https:/ /www.geonames.org/. 
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Information intimacy. Information intimacy characterizes the sensitivity of 
the requested information with respect to personal life. Values: Private, 
Public. 

Relative location. The relative location states a location that is mentioned 
which is not described by a geographic name but by its meaning for the 
respondent. Values: Without, Apartment /Flat, Neighborhood/Street, Munic- 
ipality/City, Region, Country, Continent, World, Place of work, Journey, 
Stays abroad. 

Geographic location. The name of a geographic location if mentioned. Val- 
ues: <Continent>, <Countries>, <Region>, <Government region>, Others, 
Without, Unspecific, Mixed/Multiple 

Knowledge specificity. Describes the specificity of the knowledge that is 
required to answer the question according to the origin of that knowledge. 
Values: School, Daily life, Special knowledge. 

Quantification. This feature captures the quantification of the answer. As 
opposed to Information type it is more concrete and close to physical 
quantity. Values: Frequency, Date time, Time dimension, Spatial expansion, 
Mass, Amount, Level of agreement, Boolean, Rating, Naming/Denomination, 
Order, Comparative. 

Language tone. Language tone characterizes the degree of formality or tone 
that is applied in the question. Values: Colloquial language, Formal language, 
Jargon/technical language. 

Language complexity. Language complexity characterizes the complexity of 
phrasing applied in the question. Values: Simple language, Moderate language 
level, Raised language level. 


3.8 Data Model and Vocabulary 


Our model connects to the DDI-RDF Discovery vocabulary (DISCO) [2,3]. It is 
an RDF representation of the Data Documentation Initiative (DDI) data model, 
an established standard for study metadata, maintained by the DDI Alliance!®. 
While in DISCO the focus is set on a formal documentation of a questionnaire 
and its questions, our model extends the survey questions by a conceptual rep- 
resentation with the content dimensions (question features) described in the list 
above. We arranged the question features in groups for a better overview and to 
be able to link and reuse related and similar question characteristics in the future. 
When designing the model, we tried to identify terms in established vocabular- 
ies like those mentioned in related work in order to follow best practices and 
facilitate reuse and interpretation of the data. Since the scope of our model is 
specialised towards the social sciences, reflecting very particular dimensions and 
features, for a large number of classes and properties in our model no adequate 
terms could be found in existing vocabularies. In Fig. 2 we present the designed 
model on a conceptual level. 


18 https:/ /ddialliance.org/. 
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Fig. 2. Data model. The arrows represent the question features and groups, yellow 
boxes indicate the value range for a question feature, orange boxes indicates a group. 
The gray and the white box help to connect to the context. (Color figure online) 


Our dataset is available online!? along with a SPARQL endpoint and webpage 
describing the data and providing example queries. 


4 Annotation and Enrichment 


In total, there are 165 184 machine readable and sufficiently documented vari- 
ables (i.e. questions or question-item pairs) available. The 101 554 variables hav- 
ing an English question text are included in our data set. To create a gold 
standard, we drew uniformly at random 6500 variables for manual annotation 
from this dataset. GESIS Search?? provides access to all studies and their doc- 
umentations involved in our work. 


4.1 Manual Annotation 


In a first step, we decided to focus on the feature Information type. We recruited 
an annotator based on annotation experience and knowledge about social sci- 
ence terminology to annotate this feature type. Before the annotation, the label 


19 http://data.gesis.org/questionfeaturessample/site. 
20 https: //search.gesis.org, category research data. 
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Fig. 3. Distribution of Information type L2 labels, original (left) and merged (right) 


categories were explained to the annotator. In a training phase with 100 ques- 
tion instances (excluded from the final data set), annotations that the annotator 
perceived as difficult were discussed with the authors. 

'The custom web interface guided the annotation process by displaying the 
question text, item text (if available), and the answer options. The annotator 
selected exactly two labels for each question, one label for Information type L1 
and one label for L2. Once the annotator selected a label for L1, the correspond- 
ing sub-values (L2) are presented to reduce cognitive load and avoid mistakes. 
For each question instance, the annotator reported her level of confidence on a 
scale of 0 (“not confident at all") to 10 (“very confident"). In total, 511 ques- 
tion instances were omitted due to an annotator-certainty of under 4. The final 
annotated dataset, therefore, consists of 5989 question instances. 

At the end of the process, the annotator annotated 1200 question instances 
a second time to calculate the test-retest reliability. Cohen's kappa coefficient 
reaches a substantial self-agreement of .72 for L1 and .64 for L2, a sufficient level 
of reliability to trust the consistency of the annotator. 


4.2 Automatic Prediction 


Based on the provided annotations for the Information type, we can extract this 
question feature automatically from the natural language text of the question 
and the item text, if applicable. In our case, predicting the question features 
described in Sect.3 represents a multi-class classification task. We tested and 
compared multiple classifiers on this task each for L1 and L2: LSTM [11], Ran- 
domForest [4], Multinomial Naive Bayes [18], Linear Support Vector Machines 
[7] and Logistic Regression [15]. We also took different kinds of input features 
into account: Word sequences and text structure. 

The annotated values for L1 are distributed as follows: 42.08% Evaluation, 
33.30% Fact and 24.62% Cognition. For L2, we provide the original distribu- 
tion in Fig.3 (left). The y-axis shows the percentages of relative occurrence. 
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While the classes of Information type L1 are approximately balanced, the classes 
of Information type L2 are strongly imbalanced. We assume by experience 
that the amount of data points in the smaller classes of L2 (e.g. “Believes” 
with 15 instances, or “Decision” with 39 instances) is too low to train a clas- 
sifier and therefore combine classes with insufficient instances into umbrella 
classes as shown in Fig.3 (right). For each class in L1, there is an umbrella 
class in L2: “Fact_Rest” (combining “Participation”, “Activity”, “Decision” and 
“Life Events”), “Cognition_Rest” (combining “Emotion”, “Knowledge”, “Inter- 
est”, “Motivation”, “Believes” and “Understanding”) and “Evaluation_Rest” 
(combining “Willingness”, “Acceptance”, “Prediction” and “Explanation” ). In 
the final set of classes for L2 there are nine labels: “Assessment”, “Use”, 
“Perception”, *Cognition Rest", “Preference”, “Evaluation_Rest”, “Fact_Rest” , 
“Demography”, “Behaviour”, with the biggest class (“Assessment”) having 1523 
instances, and the smallest (“Behaviour”) containing 221 samples. The umbrella 
classes of L2 are currently not part of the data model (cf. Sect. 3.3) as the respec- 
tive L1 class can be used instead e.g. “Cognition” for “Cognition Rest". 


Using Word Sequences. As natural language can be understood as a sequence 
of words, modern sequence models are a good fit to classify natural languages. 
Long-Short Term Memory (LSTM) models have shown to outperform other 
sequential neural network architectures [11] when applied to context-free lan- 
guages such as natural language. We therefore employ an LSTM architecture 
to classify the natural language questions in our data set and will subsequently 
refer to this approach as seq_Istm. 

We implemented the LSTM network using Keras’ [6] sequential model in 
Python 3.6. The model has a three layer architecture, with an embeddings input 
layer (embeddings with dimension 100), an LSTM layer (100 nodes, dropout and 
recurrent dropout at 0.2), and a dense output layer with softmax activation. The 
model is trained with categorical cross-entropy loss and optimised on accuracy 
(equals micro-f1 in a single class classification task). 

'The embeddings layer uses the complete training data to compute word vec- 
tors with 100 dimensions. The question instances are preprocessed by remov- 
ing all punctuation besides the apostrophe and converting all characters to 
lower case. For tokenisation, the texts are split on whitespaces. Since the input 
sequences to the embeddings layer need to be of equal length, we pad the sen- 
tences to a fixed length of 50 words by appending empty word tokens to the start 
of the sequence. On average, the question-item sequences contain 29 words, with 
a standard deviation of 16 words. Sequences longer than 50 words (8% of the 
question-item pairs, whereof 5096 are shorter than 60 words) are cut off at the 
end to fit the fixed input length. 


Using Text Structure. For this second approach, we used the structure of 
the question texts as input for our models. The idea behind this approach is 
the assumption of a dependency existing between the sentence structure of a 
question and the Information type. 
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Expecting the item text to provide valuable information for predicting the 
Information type through the text structure, we concatenated question text and 
item when an item was present. We extracted the structure from the otherwise 
unprocessed text by using a Part-of-speech (POS) parser to shallow parse (also 
referred to as light parsing or chunking) the question instances into a tree of 
typed chunks. From this we used the chunk types except for the leaf nodes 
(the POS tags) to define a feature vector where each component represents the 
number of occurrences of a specific chunk type. There are 27 different chunk 
types. 

For the actual parsing we choose the Stanford PCFG parser in version 3.9.2 
[19] as it is well-known and tolerant towards misspellings. However, some special 
cases in the phrasing introduce noise. Some expressions miss expressiveness as 
they refer to information presented in a previous question (“How is it in this 
case?” ) or in the answer categories (“Would you ...”). Furthermore, misspellings 
and similar errors introduce additional noise. Since the parser was able to provide 
a structure for all samples we did not have to exclude any samples. Leaving all 
5989 samples for use. 

We started testing using standard classifiers RandomForest (str.rf), Multi- 
nomial Naive Bayes (str. mnb), Linear Support Vector Machines (str.svc) and 
Logistic Regression (str. logreg) from the scikit-learn [22] library for Python. For 
each model we performed grid hyperparameter tuning on the training set with 5- 
fold cross-validation. We report parameters deviating from the default configura- 
tion. For str.svc we used C=0.5, max. iter—5000 and ‘ovr’=multi_class mode. For 
str rf n estimators—200, max features—3 and max. depth—50 was used. Again for 
str logreg we applied C=10 and max. iter—5000. Finally str. mnb was used with 
alpha=3. 


4.3 Evaluation Setup 


For evaluating, we employ five-folds cross-validation with 80% training and 20% 
test set split and use the manual annotations as ground truth. For the best 
performing approach for predicting Information types L1 and L2, we also present 
and discuss the confusion matrices. 


4.4 Results 


Table 1 displays the results for the L1 and L2 Information types. The first column 
states the name of the respective approach and model. The following two columns 
contain micro-fl and macro-fl for the L1 Information types and the remaining 
two columns do the same for the L2 Information types. 

As we can see in Table1, L1 seq.lstm has the highest micro-fl score with 
0.7640 followed by the group of str. -approaches which range between 0.5305 and 
0.6287. The macro-f1 follows the same pattern with seq.lstm at 0.7455 and the 
others again grouped together and more than 0.17 points beneath. This is similar 
for L2 where seq lstm again has the best micro-fl and macro-f1 scores at 0.4793 
and 0.482. 
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Table 1. Results of L1 and L2 Information type extraction 


Approach | L1 L2 

micro-f1 | macro-f1 | micro-f1 | macro-fl 
seq.lstm | 0.7640 |0.7455 |0.4793 |0.482 
str.svc 0.5437 | 0.524 0.3444 | 0.2610 
str_rf 0.6287 | 0.5751 0.4578 | 0.3754 
str_logreg | 0.5454 | 0.5329 0.3386 | 0.2844 
str.mnb | 0.5305 0.526 0.3377 | 0.2656 


Our anticipated usage scenario is a facetted search in a data search portal i.e. 
the GESIS Search. Here users will be presented the question features as facets 
and be allowed to use them to define their search request more precisely. Due 
to the infinite ways to formulate questions (and to specify classes), sometimes 
the assignment of a question to a class is ambiguous, also when done manually. 
Different users may associate a certain question with a different class and may 
still be correct. Thus, our intuition is that an F1 score of 0.7 could be counted 
as suitable. 

For L1 seq_Istm matches this goal. Also the str.-approaches are not out of 
range. However, results for L2 will need to be improved. Performance limiting 
factors may be low expressiveness of features and too similar classes. Given the 
high number of classes for L2 we are content with the models’ performances, 
however for the use case it might be better to merge some of the classes. For 
closer elaboration we present the confusion matrices in the Figs. 4 and 5. 


Fact Evaluation 


Cognition 


Evaluation Fact Cognition 


Fig. 4. Confusion matrix for LSTM classifier on Information type L1 


In the diagrams, the predicted classes are on the X-axis and the actual classes 
are on the Y-axis. Both confusion matrices show little mispredictions of “Fact” 
or “Fact”-subclasses. In contrast, “Evaluation” and “Cognition” get confused 
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Fig. 5. Confusion matrix for LSTM classifier on Information type L2 


more often. Especially in Fig. 5 “Assessment” (sub-class of “Evaluation” ) gets 
mispredicted as “Perception” (sub-class of “Cognition” ) and vice-versa. Also, a 
notable fraction of “Assessment” is confused with “Cognition_Rest”. Looking at 
the concerned classes’ labels, it is apparent that the concepts they represent are 
also for humans not easy to tell apart. 

To test for this we conducted a small experiment for inter-annotator agree- 
ment where we reannotated 200 of the samples through two extra annotators. 
It resulted in an average Cohen's « of 0.61 for L1 and 0.53 for L2 and Krip- 
pendorf’s a of 0.55 and 0.44. These values, except for x = 0.61, substantiate 
the notion that the task is even for humans not trivial. Which again indicates 
an indistinct design of the Information type classes, especially for L2. Suppos- 
edly a pilot study including multiple human annotators could help to define a 
clearer set of classes. However, classes should have intuitive denominations as 
complex artificial classes are hard to communicate to the users. Another way 
to overcome this could be to redesign the task as multi-class classification task. 
This however would come at the cost of simplicity for the user. Anyway, for this 
experiment the numbers show validity of our approach to a certain degree. An 
interesting question in this context will be to determine how the results change 
if the threshold for the confidence score for the inclusion of annotated questions 
into the dataset is raised. 

A few things that could be improved are e.g. the selection of features for 
the str.-approaches which is rather sparse at the moment i.e. the feature vector 
might not carry enough information for the classifiers. Hence, a solution could 
be to extend the feature selection by the inclusion of signal words. E.g. “think”, 
“find”, “believe” may indicate opinions. 
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Once there are more question feature extractions available these can be used 
as input for each other leveraging potential interdependencies between then, e.g. 
in “Fact” questions certain values for “Quantification” might be more likely. 
Following the thought the test structure approaches could potentially be reused 
to extract some of the remaining question features directly, e.g. Language tone, 
Language complexity or Focus. 

Str.* and seq.lstm approaches take different/complementary kinds of fea- 
tures into account. That is, str. * leverages solely the grammatical structure of a 
sentence, seq.lstm uses sequences of words. Thus, our intuition is that there is 
potential for a combination of them e.g. by using the predictions of both types 
of the classifiers as input into a meta-classifier. A closer analysis on the nature 
of mispredictions of the str_-classifiers will be conducted in this context. 


5 Conclusion 


We present an approach to support the search of social science survey data by 
defining and implementing methods to annotate survey questions with semantic 
features. These dimensions complement existing topic and keyword extraction 
and allow for a finer grained semantic description. 

We defined the dimensions as a taxonomy of question features (contribution 
1), and designed a data model to describe the annotated data with the dimen- 
sions and lifted it together with the variable descriptions to RDF for re-use in 
other use-cases (contribution 2). Eventually, we examined approaches to predict 
the first question feature, the Information type, by means of classification tasks 
and present word sequences in combination with LSTM as a promising way (con- 
tribution 3). However, we consider combining it with one of the text structure 
approaches in the future. 

Our question feature model offers many possibilities for applications. It is 
especially designed to be integrated in a facet filter scenario, but provides also 
multiple options for use in data linking, sharing and discovery scenarios. We 
target the GESIS Search https:/ /search.gesis.org for a possible deployment. It is 
an integrated search system allowing search of multiple resource types including 
“Research data", “Publications”, “Instruments & Tools", “GESIS Webpages”, 
“GESIS Library” and “Variables & Questions”. The current filter offers the 
facets year, source and study title for the category of “Variables & Questions”. 
These will be complemented with our Information type feature. Besides lower- 
ing the assessment times for searchers per study, it could also improve re-use 
frequency and findability especially for less known datasets. Accordingly, less- 
experienced users may find it easier to orient themselves. Given that an already 
annotated training set can be reused, data providers in turn benefit from reduced 
efforts in variable documentation since this can be done automatically. 

We are positive that there are additional use cases where a subset of our 
features can be reused to semantically describe textual contents. For example, 
short descriptions or titles of e.g. images can be annotated with the situation 
features. Also, language and knowledge features are applicable for these scenarios 
and can help to assess a text by getting to know the audience. 
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In future work, we plan to annotate and predict more features and fine tune 


the presented approach. Furthermore, a user study is planned to test for fitness in 
terms of (a) comprehensiveness of the facet and its values, (b) acceptance of the 
concept of the Information type and (c) trust in the accuracy of the annotation. 
A revision of the question feature design might still be necessary in order to fit 
user acceptance. 


Funding. This work was partly funded by the DFG, grant no. 388815326; the VACOS 
project at GESIS. 
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Abstract. While Open Data (OD) publishers are spur in providing data 
as Linked Open Data (LOD) to boost innovation and knowledge creation, 
the complexity of RDF querying languages, such as SPARQL, threatens 
their exploitation. We aim to help lay users (by focusing on experts in 
table manipulation, such as OD experts) in querying and exploiting LOD 
by taking advantage of our target users’ expertise in table manipulation 
and chart creation. 

We propose QueDI (Query Data of Interest), a question-answering 
and visualization tool that implements a scaffold transitional approach 
to 1) query LOD without being aware of SPARQL and representing 
results by data tables; 2) once reached our target user comfort zone, 
users can manipulate and 3) visually represent data by exportable and 
dynamic visualizations. The main novelty of our approach is the split of 
the querying phase in SPARQL query building and data table manipu- 
lation. 

In this article, we present the QueDI operating mechanism, its inter- 
face supported by a guided use-case over DBpedia, and the evaluation 
of its accuracy and usability level. 


Keywords: Knowledge graph - Query builder - Data visualization - 
SPARQL & SQL queries - Faceted search - Natural language queries 


1 Introduction 


Open Data (OD) providers mainly opt for publishing data by non-proprietary 
formats (such as CSV) [8]. As a publisher, it requires minimum effort due to 
the easiness of the data format, and, as a consumer, it provides free access to 
resources [3,4]. To fully benefit from OD, data should also provide their con- 
text to create new knowledge and enable data exploitation [3]. Therefore, data 
providers are strongly encouraged to move published datasets from 3-stars to 5- 
stars, i.e., to publish data in RDF format and interlink them to other resources to 
provide context [4]. 5-stars data are also referred to as Linked Open Data (LOD). 
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Among the several different definitions of a Knowledge Graph (KG), we adopt 
the definition according to a KG is achieved by attaching to LOD their schema 
(i.e., an ontology) [15]. LOD facilitate innovation and knowledge creation from 
the publishing perspective [3,4]. However, from the consumption point of view, 
LOD exploitation is threatened by the complexity of their querying languages. 
Even if SPARQL [22] has been recognized as the most common query language 
for RDF data, it proves to be too challenging, mainly for lay users [7,9]. 

'The problem we aim to solve is how to help potential users of the semantic 
web in easily accessing LOD (without requiring the explicit usage of SPARQL) 
and in exploiting the retrieved data. We aim to mainly focus on experts in data 
table manipulation and chart creation. It is not a strong limitation since many 
data visualization tools start from CSV files (or in general data tables). Thus, 
we can refer to our target users as experts in data table manipulation, and we 
aim to guide them in manipulating LOD through their tabular representation. 

We propose a transitional approach where users are guided from LOD query- 
ing to our target user comfort zone, i.e., a tabular representation of data, table 
manipulation, and chart generation. As a result, we implement this transitional 
approach in QueDI (Query Data of Interest) that allows users to build queries 
step-by-step with an auto-complete mechanism and to exploit retrieved results 
by exportable and dynamic visualizations. Users can query LOD without explic- 
itly creating SPARQL queries, and it is not required any previous knowledge of 
queried data. Users can inspect the nature of data by inspection, using natural 
language (NL) and query building. Query builders are about trading off usabil- 
ity of the proposed mechanism and its expressivity. We opt for a faceted search 
interface (FSI) enhanced by a NL query to extract results that reply to users’ 
requests and by modelling them as a table. By this approach, we cover Basic 
Graph Patterns (BGPs), such as path traversal, union, filters, negation, and 
optional patterns. The component that implements this approach (correspond- 
ing to the first step of our workflow) will be referred to as ELODIE (Extractor 
of Linked Open Data of IntErest) (pronounced elode). When users are satisfied 
with the retrieved results, they can move to the second step of our workflow, 
ie., the table manipulation, to perform aggregations, filtering, sorting; finally, 
they can represent knowledge by dynamic and exportable visualizations dur- 
ing the third and last step of our scaffolded approach. Therefore, by combining 
the expressivity of ELODIE and table manipulation, we cover SELECT queries 
which results can always be represented as a table, BGPs directly in SPARQL, 
sorting, GROUP BY, aggregation operators and filtering by table manipulation. 

The Research Questions (RQs) we aim to reply are: 


RQI1. Does the proposed approach lose in accuracy? We aim to compare our 
two-phase approach (SPARQL queries building and table manipulation) 
with the formulation of SPARQL queries only exploiting SPARQL query 
building. 

RQ2. Do lay users (users without technical skills in the Semantic Web technolo- 
gies) consider usable the ELODIE operating mechanism and its interface? 

RQ3. Are lay users able to quickly learn how to exploit ELODIE in retrieving 
data of interest? 
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The main contributions of this article are: 


— the proposal of a transitional approach to guide table manipulation experts 
in exploiting LOD by relying on their abilities in data manipulation and chart 
creation; 

— the implementation of the proposed approach in QueDI, a guided work- 
flow composed of 1) ELODIE, a SPARQL query builder provided on a FSI 
enhanced by a NL query to query LOD without explicitly using SPARQL; 
2) data table manipulation and 3) chart creation. 


The main novelties of our proposal are 1) the provision of a querying mechanism 
articulated in two steps: a SPARQL query building phase to retrieve results 
from LOD followed by a SQL building phase to manipulate retrieved results; 
2) a guided workflow from data querying to knowledge representation instead of 
the juxtaposition of visualization mechanisms to query builders. 

The rest of this article is structured as follows: in Sect.2, we overview related 
work on making semantic search more usable, and we mainly focus on the trade- 
off between usability and expressivity they propose; in Sect. 3, we present chal- 
lenges in querying LOD, the QueDI implementation overview, and a navigation 
scenario on DBpedia; in Sect. 4, we estimate the QueDI accuracy and expressivity 
by a standard benchmark dataset (QALD-9 on DBpedia) and its usability (also 
including temporal aspects); finally, we will conclude with some final remarks 
and future directions. 


2 Related Work 


During the past years, several different approaches have been proposed to hide 
the complexity of SPARQL and enable query building. Users can query KG by 
creating graph-like queries (such as FedViz [23], RDF Explorer [21]) or visual 
query formulation (e.g., OptiqueVQS [19]), they can interact with facets (e.g., 
SemFacet [2]), also enhanced by keyword search interfaces (such as SPARK- 
LIS [10] and Tabulator [5]), they can be helped by query completion (such as 
YASGUI [16]), users can work with summarization approaches (such as Sgvi- 
zler [18]), or a combination of them. The expressivity level of the querying 
method can be affected by the interaction model, the required usability, the effi- 
ciency. Some tools support users not only in retrieving data but also in visualizing 
them. We will focus on tools that combine data querying and visualization. 

In Tablel, we provide an overview of the considered tools by presenting a 
schematic comparison of query building mode, expressivity, and the need for 
SPARQL awareness by users. Moreover, we also consider the provided visualiza- 
tion mode, and if customization and export are enabled. 

Tabulator [5] leads to query (and modify) KGs without SPARQL awareness. 
Users can interact with an FSI where predicate/object pairs are reported for each 
focused element, and the user can recursively follow paths by choosing element 
by element. Besides the tabular representation of retrieved results, Tabulator 
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provides basic visualizations: if results contain temporal or geographical infor- 
mation, the user can create timelines and/or maps. It is not mentioned if the 
realized visualization can be customized and/or exported. 


Table 1. Comparison of interfaces to query KGs and visualize the retrieved results. For 
each work we report 1) the year of publication, 2) the interaction mode, expressiveness 
and the awareness of SPARQL for the Query builder, 3) visualization mode and the 
possibility to customize and export the visualization. ~ means that the feature is 
partially covered; empty cells mean that the feature is not clarified by the author(s). 


Tool Year | Query builder Visualization 
Mode Expressivity SPARQL Mode Custom | Export 
awareness 

TABULATOR [5] |2006 | facet Path Traversal) x time, map 

NITELIGHT [17] | 2008] graph SPARQL 1.0— |~ time, map 

VISINAV [12] 2010 |facet4- keywords|BGPs x time, map "4 

Sgvizler [18] 2012 |text SPARQL r4 Google Charts "4 

VISU [1] 2013 |text SPARQL r4 Google Charts 

Visualbox [11] 2013 text SPARQL "4 chart, map, time |v "4 

Rdf:SynopsViz [6] | 2014 | form BGPs— x chart, treemap, x v 
time 

YASGUI 2017 | text SPARQL 1.1 |v Google Charts Y Y 

SPARKLIS [10] 2018 |facet-- NL SPARQL— x Google Charts + |v v 
map, image 

WQS [14] 2018 | form BGPs x chart, map, time, |v Y 
image, graph 

QueDI 2020 |facet-- NL BGPs+ x chart, time, r4 "4 
image, map 


NITELIGHT [17] is a tool to create graphical SPARQL queries. Authors 
declare that it is intended for users that already have a SPARQL background 
since the complexity and the structures of SPARQL patterns are not masked 
during the query definition. A keyword browser supports the query formulation 
to lookup classes and properties of interest. The output of the query can be 
visualized as a map and/or timeline. It seems that the resulting visualization 
can neither be customized and exported. 

VISINAV [12] leads users in looking up for a keyword of interest, without 
knowing the underlying data modelling. The keyword is literally searched into 
the KG, without extending it with synonyms and related terms. Starting from 
retrieved results, the user can follow paths and select facets to manipulate and 
extend the result set. Furthermore, VISINAV supports basic temporal and spa- 
tial visualizations. While the export seems to be provided, it is not clarified if 
the customization can be performed. 

VISU [1] and Sgvizler [18] are both query builders and data visualization 
tools. Users can interact with a single or multiple SPARQL endpoints by directly 
using SPARQL (therefore users are SPARQL aware), manipulate the resulting 
table, and create customizable and exportable visualization by Google Charts. 
While Sgvizler is general-purpose, VISU is bound for university data. 
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Visualbox [11] is an environment to query KGs by SPARQL and view results 
by a set of visualization templates (called filters). These filters can be downloaded 
and wrapped in other hyper-textual documents, such as blogs or wikis. 

rdf:SynopsViz [6] provides faceted browsing and filtering over classes and 
properties inferring statistics and hierarchies from data without requiring any 
further interaction by the user. Once data have been retrieved, users can visualize 
them by charts, treemaps, timelines according to data and needs. Visualization 
can be exported, but not customized by the user. 

YASGUI [16] guides users in querying KGs by directly using SPARQL and 
visualize data through Google Charts. The query builder is enhanced by auto- 
completion, while the integration with Google Charts provides customizable and 
exportable visualizations. 

SPARKLIS [10] is a query builder based on a faceted search and a natural 
language interface. Within the tool, it offers basic visualizations, such as maps 
and image viewers. Furthermore, it is integrated with YASGUI and, thus, it 
inherits its visualization approach. Unlike YASGUI, it can mask the complexity 
of SPARQL, without losing its expressiveness. 

Wikidata Query Service (WQS) [14] is bound for Wikidata; it leads to the 
creation of queries by a form-based interface, and it provides several different 
visualization modes, such as charts, maps, timelines, image viewers, and graphs. 

Our proposal, QueDI, is a guided workflow from KG querying to data visu- 
alization. Users can query LOD by FSI enhanced by an NL query. The inter- 
face masks an automatic and on-the-fly generation of SPARQL queries. By only 
considering the SPARQL query generation phase, we cover BGPs. By also con- 
sidering the dataset manipulation phase, we cover aggregation and sorting. This 
consideration justifies that the expressivity of QueDI is more than BGP. Finally, 
customizable and exportable visualization can be created. Users can export the 
visualization as an image or as a dynamic and live component that can be embed- 
ded in any hyper-textual page, such as HTML pages, WordPress blogs and/or 
Wikis. 

The main difference with the previous works is the split of the expressivity 
of the query building phase in an implicit creation of SPARQL queries over KGs 
and by direct manipulation of datasets to perform aggregation and sorting. 


3 QueDI: A Guided Approach to Query and Exploit LOD 


3.1 Linked Open Data Querying Challenges 
The main challenges posed by querying LOD are: 


— technical complexity of SPARQL: SPARQL is extremely expressive but 
writing SPARQL queries is an error-prone task, and it is largely inaccessible 
for lay users; 

— hard conceptualisation: data can be modelled by domain-specific schema, 
or they can be domain-agnostic. Therefore, it may not be easy to conceptu- 
alize the data that users are querying; 
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— heterogeneity in data modelling: this issue is strongly related to the difficul- 
ties in conceptualization. Since different endpoints can use different vocabu- 
laries and ontologies, it is hard to figure out the terminology to use in posing 
questions; 

— scalability to manage (potential) huge amount of data; 

— portability to different endpoints; 

— readability of queries and retrieved results; 

— intuitive use in deriving results by few and clear clicks. 


By overviewing the QueDI features and its operating mechanism, we will point 
our solution to these challenges. 


3.2  QueDI Overview 


In this section, we present the QueDI system, whose goal is to enable lay users 
with a background in data table manipulation to query KGs and visualize the 
retrieved results. To guide users in the entire workflow, we split the querying and 
exploitation process into three steps (Fig. 1). Each step has a clear objective, and 
we aim to guarantee few and clear interactions a time to provide an intuitive 
use. The implemented steps of our scaffold approach are: 


Dataset creation the user starts from a SPARQL endpoint and can query 
the KG. This step aims to create the dataset of interest, i.e., a dataset that 
replies to the question of interest, without requiring any expertise in SPARQL. 
ELODIE implements this phase. By representing the SPARQL query results 
as a data table, we move from LOD to the conform zone of the experts in 
data manipulation. It represents the transitional approach from LOD to data 
table representation. 

Dataset manipulation when the user is satisfied with the retrieved data, 
he/she can start the manipulation of the dataset to refine its information 
and to make it compliant with the desired visualization. In this stage, we 
exploit the skills of our target in data table manipulation: users can refine 
results, aggregate values, and sort columns. The goal of this step is to clean 
the data table and make it compliant with the visualization requirement. 

Visualization creation (exportable and reusable) visualizations can be created 
and customized. This step realizes the immediate gratification for informa- 
tion consumers of seeing the results of their effort in a concrete artefact. We 
provide the customization based on the user's preferences and the export to 
enable the reuse also out of QueDI. 


'The entire workflow implemented in QueDI takes place on client-side, without 
any server-side computation. QueDI is released open-source on GitHub!. To see 


1 QueDI on GitHub: https://github.com/routetopa/deep2-components/tree/master/ 
controllets/splod-controllet. 
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ELODIE Dataset Visualization 
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Fig. 1. It represents the QueDI guided workflow into three phases: the SPARQL query 
building implemented by ELODIE to query KGs and organize results by a tabular 
format; the dataset manipulation to refine the table and the visualization creation 
where the acquired knowledge is graphically represented. 


how QueDI works, you can access to the online demo?. Quick tutorials? are 
available on YouTube. 


ELODIE - Dataset Creation Phase. ELODIE is a SPARQL query builder 
provided by an FSI and enhanced by an NL query. First, the user has to select 
the endpoint of interest among the provided suggestions. The supported end- 
points, at the moment, are DBpedia?, also the Live version?, the French end- 
point Persée?, the Italian endpoint Beni Culturali’, and the Chilean endpoint 
National library of Chile®. By default, ELODIE will query DBpedia. Then, we 
can move to the querying phase. 

Figure2 represents the operating mechanism of ELODIE: the user query and 
the focus determine the state of the system. The NL query represents the user 
query, therefore herein NL query and user query will be used as synonyms. While 
the user query represents the query under construction, the focus represents the 
insertion position for applying query transformation. According to the focus, 
concepts (i.e., classes), predicates (i.e., relations) and resources are retrieved 
from the endpoint and organised in facets (also referred to as tabs). More in 
detail, all the sub-classes that can refine the focus are listed in the classes tab; 
all the predicates that have the focus as subject (direct predicates) or as the 
object (reverse predicates) are listed in the predicate tab. 

Users can go on in the query formulation by selecting any element listed in 
the tabs. Therefore, the user query is iteratively defined, and the content of the 
queried source is discovered by inspection. It solves the problem of conceptualised 
data since users have access to valid options (referred to as suggestions) without 
explicitly asking for them. Suggestions are retrieved by path traversal queries, 


? Online demo of QueDI:  https://deep.routetopa.eu/deep2/COMPONENTS/ 
controllets/splod- visualization-controllet /demo.html. 

3 YouTube tutorials of QueDI: https://youtu.be/e.032GP-1l1c. 

^ https: //dbpedia.org/sparal. 

5 http://live.dbpedia.org/sparql. 

6 http:/ /data.persee.fr/sparql. 

7 http: //dati.culturaitalia.it /sparql. 

8 https: //datos.ben.cl/sparql. 
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generic enough to be used to retrieve data from any endpoints, by solving the 
portability issue. At each query refinement, the map that models user interactions 
is updated by modifying the focus neighbourhood. Then, by a pre-order visit of 
the map, both the NL and the SPARQL queries are generated. While the NL 
query will be used to verbalize the user’s interactions, the SPARQL query will be 
posed against the SPARQL endpoint to retrieve the user query’s results. Once 
retrieved, results are organized by a tabular view. The last selected element 
behaves as the new focus, and, according to it, all the facets are consistently 
updated by querying the endpoint. This process is repeated to each user selection. 


| | updated updated updated 
| NL query NLquery focus focus tabcontent tab content 
7 — update uer 

focus / is ae 
— — | ne preorder 

Ei | EET | S | | nen \ visit 
tabs | selection 
| updated * SPARQL query updateg 
results 


ELODI query map results 
retrieval 


Fig. 2. Operating mechanism of ELODIE. Starting from a user’s selection, first, the 
map that models the user query is updated, and then, by a pre-order visit of this tree, 
both the NL and the SPARQL queries are generated. When the NL query is updated, 
also the related box in the ELODIE interface will be updated. The focus will reflect 
the last added element. According to the focus, ELODIE updates the tab content by 
querying the SPARQL endpoint. About the SPARQL user query, when the results are 
retrieved, the results table is updated. 


Thanks to the FSI, users are guided step-by-step in the query formulation. 
At each step, ELODIE provides a set of suggestions (concepts, predicates, oper- 
ator, results) to go on in the query formulation by preventing empty results. 
A clarification is needed: empty results are a real desired results in a complete 
KG interpreted as a close world. Since common KGs are usually incomplete, 
empty results can be interpreted either as a real desired result or as missing 
information. As we can not automatically distinguish them, we prevent empty 
results by providing all the navigable edges outgoing from the focus as a sugges- 
tion. In other words, suggestions are focus-dependent. This exploratory search 
provides an intuitive guide in query formulation. Once a suggestion has been 
selected, it will be incorporated and verbalized into the current user query. It 
makes ELODIE a Query Builder, without asking for the SPARQL knowledge. 
SPARQL is completely masked to the final users by providing a solution to the 
technical complexities of SPARQL. 

The query, suggestions, and results are verbalized in NL to solve the readabil- 
ity issue. Therefore, instead of showing URIs, we retrieve resources label. Class 
and predicate labels are obtained by looking for rdfs:label predicate attached 
to the retrieved results and by asking for the label in the user language. If these 
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labels are missing, ELODIE looks for the English label. If also this attempt fails 
and resources are not attached to rdfs:label, the URL local names are exploited as 
labels. Suggestion labels are contextualized by phrases. For instance, instead of 
showing author as a predicate, the predicate label is wrapped into a meaningful 
phrase, such as that has an author. The user query always represents a com- 
plete and meaningful phrase. Therefore, ELODIE is a kind of NL interface. How- 
ever, it is worth to notice that users cannot freely input the query, but ELODIE 
is provided with a controlled NL query used to verbalize the iteratively created 
user query. It makes query formulation less spontaneous and slower instead of 
directly writing the query in NL, but it provides intermediate results and sug- 
gestions at each step, prevents empty results, and avoids ambiguities issues of 
free-input NL query and out-of-scope questions. Queries and suggestions can be 
verbalized in English, Italian, and French, and new supported languages with 
the same syntax (such as Spanish) can be easily incorporated. 

Only a limited number of results and suggestions are retrieved to address 
scalability issues. However, this limit can be freely changed by users. The main 
drawback of limited suggestions is that it can prevent the formulation of some 
queries. Therefore, we propose an intelligent auto-completion mechanism at the 
top of each suggestion list. At each user keystroke, it filters the corresponding 
suggestion list for immediate feedback. If the lists get empty, the list of sugges- 
tions is re-computed by asking suggestions that include the user filter. 

To promote portability, ELODIE is entirely based on Web standards: the 
entire application is written in Javascript and the interface with HTML/CSS, 
with zero configuration. It only requires Cross-Origin Resource Sharing (CORS) 
enable SPARQL endpoint URL, i.e., a specification that enables truly open 
access across domain-boundaries. 

As already stressed, ELODIE enables the formulation of SELECT queries 
(that enables the provision of results in a tabular format) by covering BGPs. 


Dataset Manipulation. This phase implements a SQL query builder pro- 
vided by a form-based interface. Users can select columns of interest, perform 
aggregation, filtering, sorting. Data manipulation is enabled by a form-based 
interface where users can choose the column to affect, the operation of interest 
(such as group by or filters), and complete it by the required parameter(s). For 
instance, he/she can ask for removing empty cells from a column, remove all 
values but numbers, filter a column by number or string operations, group the 
table by column values, aggregate values by counting or summing them, com- 
puting the average or detecting the minimum or maximum value. The sorting is 
intuitively enabled on the top of each column. These patterns enhance the BGPs 
of ELODIE. By aggregation we mean that users can perform group by and com- 
pute statistics of retrieved data, such as count, average, sum. By filtering, we 
mean that users can remove empty cells or remove cells according to textual and 
numeric filters, such as contains for strings and less than for numbers. By 
each user interaction, a SQL query is automatically created to update the result 
table. In this step also, the query formulation is completely masked to the user. 
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Visualization Creation. This step implements the exploitation phase, where 
users are guided in representing the acquired knowledge by charts. Besides 
proposing the realization of mere images, we realized a mechanism to produce 
dynamic artefacts that can be embedded in any blog, web page as an HTML5 
component. Instead of wrapping the dataset in the chart, we embed the query 
to retrieve and refine the dataset in the representation. It always ensures up to 
date results. Therefore, if data in the queried endpoint change, also their visual 
representation will change as well. According to the guidance principle, users 
are provided with a vast pool of charts, such as timelines, maps, media-players, 
histograms, pie charts, bar charts, word clouds, treemaps. Only charts compliant 
with the provided data will be enabled. According to the chosen visualization 
mode, users can customize both the chart content and its layout. Then, the 
realized chart can be download as an image or as a dynamic component. 


3.8 Navigation Scenario on DBpedia 


We detail a navigation scenario using QueDI on DBpedia. Table 2 contains iter- 
ative queries as verbalised by QueDI of a navigation scenario that retrieves the 
geographical distribution of the Italian architectural structures. At each step, the 
bold part represents the last suggestion selected by the user and the underlined 
part represents the query focus. Suggestions can be classes (e.g., city), direct 
and inverse properties (respectively, has a thumbnail and is the location), 
operators (e.g. that is equals to) and resources (e.g., dbr: Italy’). 

Fig. 3 is a collage of screenshots of the different steps of the QueDI workflow. 
On the top (Fig. 3.1), there is the user query at the end of its formulation by 
ELODIE. The focus is highlighted in yellow in the user query, and it is verbalized 
below the user query. When the user is satisfied with the retrieved results, he/she 
can move to the second step, i.e., the dataset manipulation (Fig. 3.2). In this step, 
we group data by city and count the architectural structures in each group. In 
other words, we perform data aggregation. We also sort data by the number of 
structures. Now, we are ready to visualize the retrieved results and represent 
the achieved knowledge by an exportable visual representation. The third part 
(Fig. 3.3) represents the geolocalized distribution of architecture structures on 
the Italian map. 


4 Evaluation 


4.1 Accuracy, Expressivity and Scalability over QALD-9 


In this section, we evaluate the accuracy, expressivity, and scalability of QueDI. 
As stated before, we split the query formulation into two phases, i.e., a SPARQL 
query generation to retrieve results of interest and a SQL query generation to 
aggregate and sort results. Thus, we want to verify if (and in which cases) the 
accuracy is compromised. We hypothesize that the accuracy is affected only when 


? dbr is the prefix corresponding to http:/ /dbpedia.org/resource/. 


80 R. De Donato et al. 


the complete set of query results is so huge that the queried endpoint does not 
return all the results or our platform can not manage them. We want to assess 
the expressivity level by testing QueDI on standard benchmark for question 
answering and its scalability when tested against real KGs, such as DBpedia. 


Table 2. A navigation scenario in ELODIE over DBpedia. Underlined words represent 
the focus, while phrases in bold represent the last selected suggestion. 


Step Query 
a Give me something 


Give me a city 


Give me a city that is the location of something 


Give me a city that is the location of a place 


Give me a city that is the location of a place that is an architectural structure 


cO ctu Co ny 


Give me a city that is the location of a place that is an architectural structure 
that has a lat 
7 Give me a city that is the location of a place that is an architectural structure 


that has a lat and that has a long 


8 Give me a city that is the location of a place that is an architectural structure 
that has a lat and that has a long 
9 Give me a city that is the location of a place that is an architectural structure 
that has a lat and that has a long and that has a thumbnail 
10 Give me a city that is the location of a place that is an architectural structure 
that has a lat and that has a long and that has optionally a thumbnail 
11 Give me a city that is the location of a place that is an architectural structure 
that has a lat and that has a long and that has optionally a thumbnail 
12 Give me a city that is the location of a place that is an architectural structure 
that has a lat and that has a long and that has optionally a thumbnai 


and that has a country 


13 Give me a city that is the location of a place that is an architectural structure 


that has a lat and that has a long and that has optionally a thumbnai 


and that has a country that is equals to http://dbpedia.org/resource/Italy 


Dataset. We tested QueDI, mainly focusing on ELODIE and the data manipu- 
lation phase, on the QALD-9 challenge dataset!?. T'his dataset behaves as bench- 
marks in comparing NL Interfaces. We took into account the QALD-9 DBpedia 
multilingual test set!!. For each of the 150 testing questions over DBpedia, it 
contains the English (among the multi-language options) verbalization of each 
question, the related SPARQL query, and the collection of results. 


1? QALD-9 challenge https: / /project-hobbit.eu/challenges/qald-9-challenge/. 
11 QALD-9 dataset https://github.com/ag-sc/QALD/blob/master/9/data/qald-9- 
test-multilingual.json. 
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Experiment. We evaluated the minimum number of interactions and the related 
needed time starting from the empty query (i.e., Give me something). Since we 
aim to assess the accuracy of our two-step querying approach, the expressivity 
of QueDI, and the scalability on real datasets and not the usability, we aim to 
minimize the exploration and thinking time required by users to conceptualize 
queries. Thus, we both consider the English NL formulation of the query and the 
related SPARQL query while performing them on QueDI. The measured time 
represents the best interaction time for a trained and focused user in performing 
questions on QueDI. In real use, interaction time will increase according to unfa- 
miliarity with QueDI and the queried dataset and lack of focus in exploratory 
search. We will consider usability and interaction time in Sect. 4.2. 


Give me 1000 = city 
that is the location of a place that is an architectural structure 


that has a lat 
and that has along 
and that has optionally a thumbnail 
and that has a country that is equals to http-//dbpedia org/resource/Italy 


Focus: 1st city 
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Fig.3. The 1st component represents the user query of our navigation scenario as 
formalized by ELODIE to retrieve all the Italian architectural structures and related 
geographic information; the 2nd component represents the aggregated version of the 
results table to obtain the geographical distribution of the architectural structures; 
while the 3rd component reports the visual representations of the geographical distri- 
bution of structures on the Italian map. (Color figure online) 


Results. We estimate accuracy, precision and F-measure, both for each ques- 
tion (macro-measure) and for the entire dataset (micro-measure). In Table3, 
we report achieved results. The actual code used for the comparison and the 


82 R. De Donato et al. 


results are provided on GitHub!?. The challenge report [20] contains also results 
achieved by participants, that can be used for tool comparison. 


Table 3. It reports the micro and macro precision, recall and F1 score obtained by 
testing QueDI on QALD-9 testing dataset. 


Precision Recall [FI | [Precision Rcall[Fi 


Macro results | 0.75 0.83 | Micro results | 0.92 (0.93 0.92 


Expressivity. With QueDI, we can answer 143/150 questions. Not supported 
patterns cause the failures, i.e., make computation by SPARQL operator (3/7 
cases), field correlation and not exist), and, in 2/7 cases, too many results. 


Accuracy. In 20/143 cases, we both exploited ELODIE expressivity and data 
manipulation features. By considering the queries that requires further refine- 
ment, sorting or aggregates, we observe that: in 8/20 cases we perform a group 
by to remove duplicates; in 4/20 cases we perform group by, count as aggre- 
gation and sort; in 8/20 cases only sorting is required. It is worth to notice 
that ELODIE returns the count of table tuples without requiring any further 
interaction. Only one failure is caused by a too wide pool of results (all books 
and their numbers of pages) that QueDI is not able to manage. In conclusion, 
we can consider that by splitting the querying phase into two steps, we only lose 
accuracy when the desired query is too wide and/or the desired results are too 
much to be first collected and then refined (RQ1). 


Scalability. By considering the interaction time for the 143 successful questions, 
we observe that: more than half of the questions (75/143) can be answered in 
less than 40s (with 30 s as median and average time); 115 of them can be replied 
in less than 60 s (with 0,4 as average time and 0,37 as median time); only 6 of 
them requires a time that lies between 2 min and 3 min and a half (median time 
40 s and average time 60 s). 


4.2 Usability 


We estimate the usability and the execution time in real use by providing a list 
of tasks to inexperienced participants and by collecting results of a standard 
questionnaire (we used SUS [13]) to assess the system usability (to reply to 
RQ2) and by comparing the needed time of lay users with the execution time of 
focused expert in accomplishing the same tasks (to reply to RQ3). Besides SUS, 
we also ask participants to provide subjective perception of the complexity and 
usability of QueDI, by estimating the perceived complexity in replying questions 
by QueDI, if the few indications provided by the training phase enable them to 
effectively using ELODIE, and if the effort and needed time to interact with 
QueDI is reasonable. 


12 bttps://github.com/mariaangelapellegrino/QueDI evaluation. 
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Sampling. The users involved in the testing phase are 23 in total: 11 with skills 
in computer science and dataset manipulation (we involved both students still 
studying and already graduated) and 12 lay users, without any technical skill in 
querying language and heterogeneous background. 


Experiment. We structured the evaluation as follows: 


— we performed 15min of training to provide users with the opportunity to 
become familiar with QueDI (in particular with ELODIE) and the queried 
data by performing guided examples and by answering to queries of incre- 
mental complexity. All users were not aware of QueDI in advance; 

— testing phase: six tasks (Table 4) are submitted in the Italian language asking 
for the use of DBpedia. The tasks are of incremental complexity, as for the 
training phase. For each task, the user reported the completion time and 
filled in an After Scenario Questionnaire (ASQ) using a Yes/No answers to 
evaluate 1) the degree of the perceived difficulty of the task by performing it 
through ELODIE, 2) if the time to complete the task is reasonable, 3) if the 
provided knowledge in the training phase is sufficient to complete the task. 

— in conclusion, we asked for the fulfilment of a final questionnaire to evaluate i) 
the user satisfaction based on a Standard Usability Survey (SUS [13]) and ii) 
the interest in using and proposing the tool by a Behavioural Intentions (BI) 
survey. The questions of the BI survey are: i) “J will use the system regularly in 
the future" ; ii) “I will strongly recommend others to use the system” and users 
can use a 7-point scale to reply. In the end, the participants in the evaluation 
study were free to suggest improvements, report the main difficulties, and the 
strengths of QueDI as open questions. 


Table 4. Tasks provided during the evaluation of the usability of QueDI. 


Task # | Query 

Task 1 | The Italian museums 

Task 2 | The games with at least 2 players 

Task 3 | The presenters who are the presenter of a TV Show 
Task 4 | The female scientists born and dead in Germany 
Task 5 | The athletes which are not dead 

Task 6 | The artists born in the same place of an athlete 


Usability. The SUS score is 70 for the first group and 68 for the second one. 
According to the SUS score interpretation, all the values at least equal to 68 
classify the system as above the average. That means that QueDI is considered 
usable both for technical and lay users (RQ2). In the open questions, it is clear 
that the perceived usability is closely related to the training phase: users - espe- 
cially not experienced ones - need initial training to get familiar with KGs and 
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their modelling. About BI, both the groups reached an average score of 5 in both 
the questions, i.e., there is an overall intention to reuse and propose ELODIE. 


Execution time. For each group, we consider the execution time compared to 
the time needed to one expert of the field (also familiar with QueDI) - hence 
called optimal value. The results related to the first group - the Computer Science 
experts - are reported in Fig. 4a. While the results related to the second group 
- the lay users - are reported in Fig. 4b. 
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a) Execution times of the computer science group. b) Execution times of the lay users. 


Fig. 4. The time is reported on the y-axis, the tasks on the x-axis. The square icons 
represent the average score. The black dots represent the optimal value. The grey 
diamonds represent outliers. 


In all the queries - but the last query for the second group - the minimum 
time needed by the participants either matches the optimal one or it is even 
better. It is a surprising result, and it means that there are users (at least one in 
each group) able to get familiar with QueDI and learning how to use it in a short 
time (RQ3). About the outliers, in the open questions, it is evident that the main 
difficulties are in “finding the exact way to refer to an asked predicate or concept” 
(reported by 6 out of 23 users). The participants suggest to “insert inline help, 
tool-tips to help the users during the usage, examples of usage" (reported by 9 
out of 23 users). The start is considered a small obstacle to face: *After a bit of 
experience, the system is pretty easy to use” (reported by 6 out of 23 users). 


5 Conclusion and Future Work 


In this article, we present a transitional approach to bring closer the Semantic 
Web technologies and the community of tabular data manipulation and repre- 
sentation by enabling querying and visualization of LOD. We implement the 
proposed approach in QueDI, a guided workflow from data querying to their 
visualization by dynamic and exportable data representations. We propose to 
split the querying phase in SPARQL queries building and data table manipula- 
tion and we loose in accuracy only when results are too much to be first retrieved 


QueDI: From Knowledge Graph Querying to Data Visualization 85 


and then filtered (RQ1). The 70 score according to the SUS questionnaire reports 
that QueDI is considered usable by lay users (with and without table manipula- 
tion skills) (RQ2). The needed time by users with computer science background 
to interact with ELODIE is almost indistinguishable by the execution time of 
focused users, experts in QueDI features (RQ3). 


Future Work. The described evaluation is a preliminary experiment to assess 
QueDI performance. We are defining a comparison between QueDI and state of 
the art. We aim to enrich the proposed endpoints by also considering the integra- 
tion of a proxy to overcome the issue of not CORS-enabled endpoints. Moreover, 
we aim to further simplify the exploratory search in retrieving suggestions by 
also considering synonyms and alternative forms of the queried keywords. 
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Abstract. A key source of revenue for the media and entertainment 
domain is ad targeting: serving advertisements to a select set of visitors 
based on various captured visitor traits. Compared to global media com- 
panies such as Google and Facebook that aggregate data from various 
sources (and the privacy concerns these aggregations bring), local compa- 
nies only capture a small number of (high-quality) traits and retrieve an 
unbalanced small amount of revenue. To increase these local publishers’ 
competitive advantage, they need to join forces, whilst taking the visi- 
tors’ privacy concerns into account. The EcoDaLo consortium, located in 
Belgium and consisting of Adlogix, Pebble Media, and Roularta Media 
Group as founding partners, aims to combine local publishers’ data with- 
out requiring these partners to share this data across the consortium. 
Usage of Semantic Web technologies enables a decentralized approach 
where federated querying allows local companies to combine their cap- 
tured visitor traits, and better target visitors, without aggregating all 
data. To increase potential uptake, technical complexity to join this con- 
sortium is kept minimal, and established technology is used where pos- 
sible. This solution was showcased in Belgium which provided the par- 
ticipating partners valuable insights and suggests future research chal- 
lenges. Perspectives are to enlarge the consortium and provide measur- 
able impact in ad targeting to local publishers. 


Keywords: Advertisement - Federation * Linked Data 


1 Introduction 


Digital advertising is the act of serving advertisements (“ads”) in different for- 
mats to visitors who consume online content on publishers! websites. It is a key 
source of revenue in media and entertainment domain: advertisers that set up 
an ad campaign receive revenue from the company ordering the campaign, and 
publishers receive money from advertisers to display ads. When setting up an 
ad campaign, advertisers specify which and how many ads are served (from one 
or more companies) as well as its format. 

In ad targeting, an advertiser also defines a pre-selected set of visitors based on 
various traits, e.g. geography, demographics, psychographics, browsing behavior, 
© The Author(s) 2020 
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or past purchases. Ad targeting increases the probability of a visitor reacting 
positively compared to serving the same ad to every visitor [30], and, thus, 
results in higher return on investment for both publishers and advertisers. 

The profile, the trait set of a visitor, needs to be captured, using observations 
via various complementary channels. For example, when Alice visits the sports 
page of a publisher’s website more than eight times per month, that publisher 
— or a third-party tracker — adds the trait “liking sports" to Alice’s profile (web 
browsing behavior observation). When Alice registers herself on that website and 
enters her birth date, her age range trait (e.g., 40-55) is also added to her profile 
(demographics observation). Alice can be targeted by the profile “People over 
85 years old liking sports”, as her profile matches, as long as sufficient consent 
was provided upfront by Alice. When more traits of Alice are captured, she can 
be targeted by more (and more specific) ads. 

However, profile data, and the revenue they entail, are unevenly dis- 
tributed [3]: it was predicted that, in the first quarter of 2016, 85% of online 
advertising spendings would go to either Google or Facebook [14]. Such global 
publishers are media conglomerates and track visitors far beyond their own 
media properties. It is estimated that at least 68% of the most popular web- 
sites are tracked by Google [10]. These companies aggregate and centralize a 
large amount of data, and enable advertisers to create rich profiles. In contrast, 
local publishers hold only a fraction of visitor traits, as found on their own web- 
sites. Those traits are typically of higher quality compared to global companies, 
as local publishers have a closer relationship with their visitors. However, local 
publishers typically miss the opportunity to target visitors matching a requested 
profile, due to lack of scale, and, hence, miss out on revenue. 

Combining multiple local publishers’ data can improve the profiling infor- 
mation and make their generated profiles — due to higher quality — competitive 
to global publishers. However, aggregating and centralizing all data understand- 
ably comes with limitations. Recent large-scale data scandals made the general 
public increasingly aware of the importance of privacy and control over personal 
data. The introduction of the General Data Protection Regulation (GDPR) in 
the European Union [9] enforces explicit, freely-given consent for sharing per- 
sonal data. More, sharing all data across publishers would not be well received 
by the publishers, as this would result in loss of competitive advantage. The data 
should thus remain exclusive to each publisher. 

Using federated querying the data remains spread among — and under con- 
trol of — publishers. However, it allows discovering visitors that adhere to a 
certain targeted profile, combining the relevant data from multiple publishers 
via federated querying. Linked Data [1] acts as an enabling technology: (i) the 
interoperable layer allows uniform and unambiguous trait descriptions across 
publishers and (ii) richer profiles are created via federated querying, while the 
data does not need to be shared across publishers. The usage of semantic tech- 
nologies, thus, allows local publishers to join forces, leveling the playing field 
with global companies. Local publishers and advertisers do not need to fully 
share their data, whilst improving ad targeting. 
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A solution based on federated querying is devised mapping publisher’s custom 
trait definitions to a common SKOS vocabulary [20], generating RDF datasets 
using RML [7] and queried using Comunica [26]. This solution is applied to 
and deployed in the media landscape of Flanders, Belgium, as it is explained 
at https://vimeo.com/374617281. A consortium was formed, dubbed EcoDaLo, 
consisting of complementary partners to deploy this interoperable layer: Adlogix, 
Pebble Media, and Roularta Media Group. 

We present the role semantic technologies play in EcoDaLo, allowing feder- 
ated advertisement targeting in Belgium. After introducing the use case (Sect. 2), 
we present our application (Sect.3). Our approach was showcased by multi- 
ple companies in Belgium (Sect.4), allowing federated integration of traits to 
improve targeting across local publishers. We functionally evaluate our solution 
(Sect. 5), present related work (Sect. 6), and conclude by discussing privacy and 
ethical considerations as well as key features of our solution (Sect. 7). 


2 EcoDaLo 


The EcoDaLo consortium is one of the first collaborations where publishers 
remain in exclusive control of data they collected, and a decentralized deploy- 
ment is attained. Three complementary funding consortium partners participate 
in EcoDaLo. AdLogix is a development company experienced in digital advertis- 
ing, which developed multiple advertising products on the international market!. 
It is responsible for providing technical support to build a production-ready sys- 
tem that can be used by both advertisers and publishers. Pebble Media is a 
digital sales house, representing the role of advertiser, with many partnerships in 
the local market?. Roularta Media Group is a multimedia group, represent- 
ing the role of publisher, and market leader in the field of radio and television, 
magazines, and local media in Flanders?. As domain experts, Pebble Media 
and Roularta Media Group are responsible for providing technical requirements, 
aligned with the current advertising industry landscape. As all bases are covered 
by the different consortium partners, the devised solution remains in line with 
industrial perspectives, and chances of successful impact increases. 


Motivating Example. Alice visits the websites of publishers A and B (Fig. 1). The 
publishers have different ways of identifying Alice’s traits. Publisher A knows 
her age range because Alice registered her birth date: Alice is identified with id 
A123 and she gets assigned trait A_over_35 (Fig. 1, 1). Publisher B — specialized 
in football content — deduces that Alice is football lover, because she visits any 
of the publisher's web pages more than once a week: Alice is identified with id 
B456, she gets assigned trait B likes football (Fig. 1, 2). None of the publishers 
can provide enough traits to match Alice to “sports lovers over 35" (Fig. 1, 3). 
And even if publishers could combine their user traits, it is not clear whether 


1 http://www.adlogix.eu/. 
? http:/ /www.pebblemedia.be/. 
3 https:/ /www.roularta.be/en. 
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a football lover qualifies as a sports lover or not. EcoDaLo aims to enable this 
potential, semantically — i.e., meaningfully — combining the captured data, and 
serve Alice relevant advertisements, targeted at the requested profile. 
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2E Visit football page 1, 2, 3 
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Fig.1. Publisher A knows Alice is over 35 years old, and Publisher B that she likes 
football. However, Alice cannot be targeted, as her captured traits from different pub- 
lishers cannot be combined. 


3 Federating Advertisement Targeting with Linked Data 


EcoDaLo aims to improve ad targeting by combining visitor data across publish- 
ers. This allows leveling the playing field between local and global publishers: 
local publishers can target more visitors, and their captured visitor traits are of 
higher quality compared to those of global publishers. Typically, integrating all 
publishers’ data results in an additional ad server having access to a large amount 
of data. This provides a global fine-grained view of every individual visitor, and 
allows detailed analysis over all data. However, it also requires publishers to give 
up control over the data they captured (Fig. 2, left). 

'The addressed challenges include cross-publisher targeting without sharing 
all data and providing an extensible and scalable framework to various new 
partners. We chose to keep the data spread across publishers, and let a separate 
neutral party do federated querying on the level of captured visitor traits, using 
unambiguous semantic descriptions, instead of integrating all publisher's individ- 
ual observations. For example, not every observation that Alice visits a football 
page is shared across publishers, only Publisher B's (aggregated) captured trait 
that Alice likes football is taken into account during federated querying. Also the 
aggregated captured trait is not shared with other publishers, it is only taken 
into account by the federated query layer. 
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The combination of federated querying, and only considering the captured 
visitor traits instead of all data, alleviates privacy concerns, improves scaling 
behavior, and exploits existing infrastructure. The disadvantage is that ad tar- 
geting by combining visitor traits is not as fine-grained as integrating all data. 
For example, it is not possible to target visitors that “visited at least three sports 
pages across all publishers in the last 10 days”, as such information is not shared. 

Visitors’ privacy is protected to a certain extent: no fine-grained information 
is shared across consortium partners. Visitors are, to this point still, identifiable 
across publishers, but the captured traits (and links from these traits to unique 
visitors) remain under (exclusive) control of the publishers. The business rules 
of how those traits are captured remain exclusive to the individual publishers. 

The solution scales as less data needs to be federated: a captured trait can be 
an aggregation from a large number of historical observations. Considering only 
the aggregations can reduce the amount of data by multiple orders of magnitude. 

Publishers’ existing trait capture infrastructure is reused, compared to 
installing a large new trait capture infrastructure. The existing infrastructure 
— optimized to aggregate large amounts of (historical) data to capture traits — 
remains unaltered: its output, i.e., the discovered visitor traits, serves as a data 
source for the federated querying. T'his reduces development effort for the con- 
sortium partners, and increases the chances of adoption by more publishers. 


Federated 
PES Query 


Ad server 3 EcoDaLo ad server 


Ad request Ad response Ad request Ad response 


Fig. 2. Current practice (left) vs. EcoDaLo (right). Typically, all captured data is inte- 
grated in a common ad server (center-left), and publishers give up control of their data 
(dotted DBs, top-left). In EcoDaLo, a common ad server only captures visitor identi- 
fication data (small DB, center-right). Publishers require an additional layer enabling 
federated querying (extra DB outlines, top-right). 


In this section, we provide a high-level overview (Sect.3.1) and an example 
(Sect. 3.2), after which we discuss our design considerations (Sect. 3.3). 
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3.1 High-Level Overview: Federated Querying with Common 
Identifier 


Our solution consists of three main components (Fig. 2, right): 


(i) The EcoDaLo ad server — auxiliary to the pre-existing ad servers used by 
each respective publisher — targets and serves ads to visitors across pub- 
lishers (Fig. 2, (3)). This ad server only provides the common identifier; the 
visitor traits remain under the individual publishers' control. 

(ii) Each publisher provides a semantic layer, exposing the captured visitor 
traits mapped to an interoperable unambiguous trait model (Fig. 2, Œ). 

(iii) A federated querying intermediate layer connects the additional ad server 
with the individual publishers (Fig.2, @)). Due to the explicit semantics, 
we provide an interoperable layer, extensible to new partners. 


3.2 Example of Federated Querying with Common Identifier 


Using our solution, Alice can be targeted by combining multiple traits from 
different publishers (Fig. 3). Alice visits a website of Publisher A as a registered 
visitor (Fig. 3, (1)). She is identified as new visitor within EcoDaLo (EcoDaLo id 
E1, Q). Alice then browses some football pages of Publisher B as an unregistered 
user (3). She is recognized as existing visitor within EcoDaLo (4). 

When a new campaign is launched, the trait combinations are queried, fed- 
erated over the different publishers ©). The mapping to a common trait model 
is used to query the individual publisher’s captured traits, e.g., over_35 is found 
mapped from A.over.35, and sports. lover mapped from B_likes_football. 

When Alice then visits à consortium publisher, such as Publisher A, her 
set of captured visitor traits is sent to the EcoDaLo ad server (6). Alice's trait 
set matches with the mapped target set, Alice is targeted by the campaign, 
and a relevant ad is served (7). Her EcoDaLo id E1 makes sure the number 
of times Alice gets served a specific ad is monitored correctly, even when she 
visits Publisher B. During ad targeting, Alice is not uniquely identified, i.e., 
her EcoDaLo id is not used during federated querying, only during ad serving. 
'Thus, the explicit link between Alice and her captured traits is never stored in 
the EcoDaLo ad server. 


3.3 Design Considerations 


Each publisher has its own trait definitions. This influences ad campaign defini- 
tions and visitor targeting. Before defining an ad campaign, a common, unam- 
biguous trait model is needed for the traits targeted by advertisers, those cap- 
tured by publishers, and the relationships between them. For example, Publisher 
A captures three age ranges (“<18”, “18-35”, “>35”), and Publisher B captures 
five age ranges (^«18", *18-25", “25-35”, “35-65”, “>65”): the targeted trait 
“over 35" is mapped differently for Publisher A and Publisher B. 
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Fig. 3. Alice is targeted by combining multiple traits, from different publishers. 


A semantic model further allows description of trait relations, e.g., the rela- 
tionship between Publisher B’s captured “likes football” and the more general 
“sports lover” can be specified. Instead of requiring all consortium partners to 
alter their system and impose usage of a common trait model, publishers map 
their existing captured traits to a common model. This increases flexibility: a 
single captured trait can be mapped to multiple common traits, and a combina- 
tion of captured traits can be mapped to a single common trait. This increases 
the chances of adoption as changes in the publishers’ pre-existing infrastructure 
and required effort are minimized. 

A visitor can thus be targeted by combining the captured traits across pub- 
lishers, and served a relevant ad. However, to monitor how many ads are served 
to how many distinct visitors, a shared identification mechanism is still needed. 
Multiple options were considered to identify visitor across publishers, among 
others, machine learning and browser fingerprinting: 

Machine learning techniques could help identify individual visitors based on 
their combination of traits. However, more detailed data is not available — given 
that no fine-grained observations are shared — and this would require to create 
a training set of visitors and addressing the related emerging privacy concerns. 

Browser fingerprints [18] provide a quasi-unique identification mechanism by 
combining visitors’ browser and hardware traits, e.g., installed plugins, screen 
resolution, etc. The identification is not 100% accurate, and identification is 
limited to visitors using a single browser and device. 

However, these options were dismissed due to the inability to provide 100% 
accurate results. Given the domain, where inaccuracies are already manifold 
(e.g., visitors using multiple devices, sharing the same accounts, etc.), the con- 
sortium decided not to add more inaccuracies. Instead, we use the EcoDaLo ad 
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server as identifying service, which provides and explicitly links common (Eco- 
DaLo) ids to the visitor ids of each consortium partner. 

'The ad server only stores its own generated ids, mapped to the ids of the 
individual publishers. For example, when Alice first visits Publisher A, she is not 
yet identified within EcoDaLo, the ad server creates a new common EcoDaLo id 
E1, and connects this id with A123, Alice's id of Publisher A (Fig.3, (2). When 
Alice later visits Publisher B, given her previously assigned EcoDaLo id E1, the 
ad server is updated and Publisher B's id B456 is added (Fig. 3, ®©). 


4 Deployment 


EcoDaLo's technical considerations include setting up the EcoDaLo ad server 
(Sect.4.1), using a common trait model (Sect.4.2), mapping each publisher's 
traits to that common trait model (Sect. 4.3), federating the traits (Sect. 4.4), 
and exposing the results to the EcoDaLo ad server (Sect. 4.5). The (development) 
effort for partners to integrate with the EcoDaLo set-up is kept low to increase 
potential uptake and growth of the consortium. 


4.1 EcoDaLo Ad Server 


The EcoDaLo ad server: (1) provides common visitor ids across consortium part- 
ners, (ii) serves ads of campaigns set up within the EcoDaLo consortium, and 
(iii) monitors the number of ads served to distinct visitors. As such, established 
pre-existing ad server software can be used to fulfill multiple requirements. We 
employ an ad server that provides identifiers for every visitor of any website 
within the consortium. Each publisher needs to modify its websites, allowing 
access to the EcoDaLo ad server to add these identifiers. 

'The expected effort is reasonable, as the publishers would need to support 
ads served due to campaigns set up in EcoDaLo in any case. Publishers that 
advertise are already required to gather GDPR-compliant visitor's consent for 
ad targeting involving third-parties, i.e. informing the user who will have access 
to which information for which purpose. Thus, no additional effort regarding the 
consent gathering setup is needed compared to existing solutions. 


4.2 Common Trait Model 


We use an interoperable, semantic model to describe the common traits, as it 
enables meaningful federation across publishers. We provide a Simple Knowl- 
edge Organization System (SKOS) taxonomy [20] based on the IAB Technology 
Laboratory's Audience Taxonomy 1.0* as common trait model. The IAB Tech- 
nology Laboratory (IAB Tech Lab) is an international nonprofit consortium that 
helps companies implement global advertising industry technical standards and 
solutions. We only considered IABs audience taxonomy as a possible common 


t https://bit.ly/3fqV Vko, hosted locally at https://bit.ly /2zfhkwe. 
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trait model, other trait models can be used or created instead. This taxonomy 
is available at http://semweb.mmlab.be/ns/iab/at_1-0, mapped from the origi- 
nally published taxonomy to SKOS using YARRRML [6,15], and processed as 
RML rules [7,8]. The mapping rules are available at http: //semweb.mmlab.be/ 
ns/iab/mapping/iab_audience.mapping.yaml and http://semweb.mmlab.be/ns/ 
iab/mapping/iab audience.mapping.rml.ttl. 

'The modeling effort is limited compared to the typical approach where all 
publishers’ data is integrated: only the traits need to be modeled, as opposed to 
all types of publisher observations and descriptions of how observations lead to 
a captured trait. For example, we do not need to model that the set of observa- 
tions “visiting at least three football pages the last 10 days" is used to capture 
the “football lover”-trait. The use of a declarative mapping language allows for 
possibly fine-grained mappings including the use of functions but can also be 
created manually in a hard-coded fashion. In any case, we provide a transparent 
and maintainable process, adaptable for change, as the Audience Taxonomy is 
currently released for public comment. 


4.3 Mapping to the Common Trait Model 


Each publisher is required to provide a mapping of the captured internal traits 
to the common ones. This mapping can be many-to-many, across multiple levels. 
For example, “football lover" is mapped to “Sports—American Football" and 
the more general “Sports”, and “tennis lover" is — next to “Sports—Tennis” — 
also mapped to “Sports”. More granular mappings can be taken into account, 
e.g., distinguishing the levels of interest of a “football lover". 

The Resource Description Framework (RDF) [5] is useful to describe the 
mapping, as it natively allows to unambiguously link concepts in complex rela- 
tionships. For usability reasons, consortium partners — which are non-Semantic 
Web experts — do not need to manually write RDF triples. Instead, they provide 
a mapping of their custom captured traits to the common trait model, by means 
of a CSV file with three columns: the publisher's captured internal trait id and 
label, and the common trait.id from the IAB Audience Taxonomy. 

'This CSV file is then used to generate the RDF dataset mapping each pub- 
lishers’ internal traits to the common trait model. The generation description 
is written in YARRRML [6,15], a representation of RML [7,8] (Listing 1): the 
generation process remains maintainable, whilst consortium partners are not 
bothered with the details of how RDF triples are generated. Every time the 
mapping changes, ie., when a publisher captures new visitor traits, the RDF 
dataset is regenerated and republished. 

We provide a transparent and maintainable generation process, adaptable for 
change, by using RML. The generation description remains user-friendly relying 
on a CSV configuration document: CSV is easily handled using standard office 
suites, and a common export format for many software packages. 
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1 | prefixes: 

2 iab: "http://semweb.mmlab.be/ns/iab/at_1-0#" 

8 pubA: "http://example.com/publisherA#" 

4 ex: "“http://example.com/#" 

5 | mappings: 

6 iabmapping: 

7 sources: ["mapping.csv~csv"] 

8 subject: "pubA:trait_$(id)" 

9 predicateobjects: 
10 - [ex:comesFrom,  "pubA:publisher^iri"] 
11 = [ex:originalld, "$(id)"] 
12 - [rdfs:label, "$(1abel)"] 
13 - [ex:iab at, "iab:$(trait id)^iri"] 


Listing 1. YARRRML file to describe the RDF generation from CSV data. 


1 | PREFIX ex: <http://example.com#> 

2 | PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
3 | CONSTRUCT { 

4 Tiab a ex:IABSegment ; 
5 rdfs:label ?iabLabel š 
6 ex:hasSegment ?s " 
7 ?s ex:hasId ?id $ 
8 rdfs:label ?label $ 
9 ex:from TfromId 

10 | } WHERE ( 

i4 ?s ex:iab_at ?iab ; 
12 ex:originalld 7?id ; 
13 rdfs:label ?label $ 
14 ex:comesFrom X ?fromId 

15 Tiab skos:prefLabel ?iabLabel 

16 | } 


Listing 2. SPARQL CONSTRUCT query for finding all captured traits across all 
publishers. 


4.4 Federated Querying Layer 


Federated querying of cross-publisher traits is enabled using the generated inter- 
operable RDF datasets of each publisher. Each publisher’s mapping dataset is 
generated in HDT format [11,19], and published as a Triple Pattern Fragments 
(TPF) endpoint [29]. The federated query engine Comunica [26] queries over 
the TPF endpoints of each publisher, and over the published SKOS Audience 
Taxonomy. An example of a federated query for all captured traits is shown in 
Listing 2. The traits are found across all publishers (line 11), and returned with 
their preferred label from the published SKOS Audience Taxonomy (line 15). 
Publishing results using TPF endpoints in lower server-side CPU usage - thus 
requiring minimal investment of the consortium partners — and — in combination 
with Comunica — delivers state-of-the-art federated querying performance [29]. 


4.5 Developer-Friendly API 


A JSON(-LD) [25] API is provided that exposes the results of federated SPARQL 
queries, easing integration with the EcoDaLo ad server. The JSON-LD context 
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hides the individual URI prefixes. This API is consumed (daily) by the EcoDaLo 
ad server, to have an updated view of the consortium partners’ captured traits. 

In an initial stage, the complexities of using RDF are hidden from the part- 
ners, which lowers the threshold for new partners to join the consortium: no 
prior Semantic Web knowledge is needed. 


5 Validation 


The consortium collected focus groups to make sure the devised solution is in line 
with the industry’s common practices. All decisions were communicated in face- 
to-face meetings, and feedback was gathered using the think-aloud method [24]. 
We discuss a launched campaign that evaluate the added benefit of EcoDaLo 
and compare EcoDaLo to other approaches based on six identified features. 


5.1 Launched Campaign 


Our devised solution reached Technology Readiness Level (TRL) 5: we imple- 
mented and validated it in a relevant environment within a launched advertise- 
ment campaign in the end of August 2019 in Flanders, Belgium. 

The Belgian university Vrije Universiteit Brussel (VUB)? acts as client in 
the launched campaign and wants to target (potential) students to maximize 
the registrations for the open VUB day on September 7th 2019°. The campaign 
targets (combinations of) both overlapping and complementary captured visitor 
traits of both Roularta Media Group and Pebble Media in different advertising 
formats, such as “half page” or “mobile leaderboard”. Additionally we measured 
the traffic to the website of the open day at VUB using a tracking pixel. 

Our devised solution has been presented to the industrial partners and served 
as technological base for the described campaign. Around 1.84 million impres- 
sions were delivered by Pebble Media and 1.03 million via Roularta Media Group; 
VUB reported that compared to last year 300 extra people were registered for 
the open day. Additionally industry partners using our solution reported insights 
in different renumeration models, i.e. how to split revenue based on provided 
knowledge about visitor traits and advertising format of the impression. 


5.2 Functional Comparison 


To evaluate the added benefit of EcoDaLo, we perform a functional evaluation 
of six features, comparing EcoDaLo (trait federation) to the status quo of a local 
publisher, a global publisher, and an integration approach (Table 1). 


Trait quality. The trait quality of a local publisher is — due to the locality — 
higher compared to those of a global publisher. This high quality is retained 
when integrating the captured data or federating the traits. 


? https: //www.vub.be/en/home. 
6 https: //www.vub.be/events/2019 /infodag-7-september-2019. 
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Table 1. Summary of functional comparison of advertisement approaches with respect 
to a set of identified features. 


Feature Local publisher | Global publisher | (Data) integration | (Trait) federation 
2 | Scale — ++ + 3E 
3| Exclusive (privacy) | ++ — = 3e 
4 | Ease of set-up ++ ++ = + 
5 | Interoperability —-— —-— — ++ 
6 


Maintainability — — = 3 


Scale. The number of visitors that can be targeted, and the number of different 
traits that can be captured, i.e., the scale, increases during integration and 
federation, however, it does not necessarily reach the same numbers as for 
global publishers. 

Exclusive. Captured data is shared with a global publisher or during integra- 
tion: it is no longer exclusive to a single local publisher. During federation, 
only common traits are shared with the EcoDaLo ad server, the visitor's 
privacy with respect to all collected data is considered. 

Ease of set-up. Federation requires only the mappings of aggregated traits 
compared to integration where all observation types must be mapped to a 
common model; which still requires effort but less. 

Interoperability. Integration slightly improves interoperability by using com- 
mon definitions, as compared to the closed environment of local and global 
publishers. However, the Linked Data principles renders the federation app- 
roach entirely interoperable and machine-understandable. 

Maintainability. Attention was put into improving the maintainability of the 
federation approach, specifically, into maintainability of the common trait 
model generation and the trait mapping description. 


6 Related Work 


We describe related work regarding privacy, semantic web and advertisement. 

Online behavioral advertisement (OBA) is controversial: on the one hand, it 
creates more relevant and efficient ads, on the other hand raises privacy concerns 
as it is based on personal data. For a complete overview of the topic we refer 
the reader to the literature review of Boerman et al. [2]. 

'The W3C Data Privacy Vocabularies and Controls community group devel- 
oped a vocabulary to annotate and categorize instances of legally compliant 
personal data handling [23]. This is complementary to our solution as their 
vocabulary describes consent and data processing purposes in EcoDaLo. 

The SPECIAL project proposed a privacy-aware big data architecture 
focused on consent management and compliance verification [17]. It was devel- 
oped in parallel with EcoDaLo. SPECIAL's sticky privacy policies, data use 
constraints attached to data, could be realized within EcoDaLo by also mapping 
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consent-related information from publishers to a common data model, similar to 
visitor trait data, providing the added feature of ex-ante compliance checking. 

Publishers join forces by introducing an integration component that allows 
aggregating all involved publishers’ captured data [21]. This requires consider- 
able development effort, tailored to existing publishers’ data stores and detailed 
privacy-compliance considerations. As such, a federated approach for querying 
captured visitor traits is, to the best of our knowledge, novel for ad targeting. 

Usage of Semantic Web technologies to enable trait federation in the media 
and entertainment domain was not yet investigated. Existing related work 
instead focuses on automatically generating meaningful targeting profiles, by 
(i) classifying content and ads to form one knowledge graph, and (ii) using that 
knowledge graph to improve ad recommendation algorithms: 

The semantic classification is either created manually [28], or content and ads 
are Classified automatically to a common predefined knowledge graph [4]. Choos- 
ing between manual or automatic classification typically introduces a trade-off 
between quality and scalability. When improving the quality of the automatic 
classification, existing Linked Open Data graphs are used to complete the knowl- 
edge graph [13], and the explicit semantics are exploited to provide detailed 
tagging of content and ads [12]. During recommendation, typically, graph dis- 
tance metrics are used as a measure of relatedness [31], an approach applied 
successfully in the academic publishing domain [27]. 

For EcoDaLo, ad targeting profiles are created manually by the advertiser. 
Related work is thus complementary, enabling improvements as future work: 
recommendation methods can be used to suggest inclusion or exclusion of certain 
traits when specifying an ad campaign. 


7 Conclusion 


Advertising is a monetary stimulus for individuals to share their data with pub- 
lishers and advertisers, in exchange for content. Although not the only option, it 
is very common in the media and entertainment domain. Lately, awareness rises 
concerning the trade-off between respecting an individual’s privacy and increas- 
ing advertising revenue. In EcoDaLo, we introduce an interoperable semantic 
layer among local publishers allowing to exploit high-quality visitor traits using 
federated querying, without sharing data among consortium partners. 

We conclude by discussing privacy and ethical considerations, key features 
of our approach as well as outlining future work. 


7.1 Privacy and Ethical Considerations 


The misuse of personal data, especially for discrimination, is unethical and ille- 
gal; transparency and ethical guidelines may address this issue. 

Intransparency regarding the use of data collected via online behavioral 
advertising may be harmful and unethical if consumers are unaware [2]. 
The GDPR addresses transparency with respect to user-awareness about which 
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personal data" is shared with whom and for which purpose by listing obligations 
regarding valid consent obtainment. Recent court rulings applied these regula- 
tions on concrete cases [16] emphasizing on explicit opt-in to give consent. 

For EcoDaLo, users need to be aware which personal data of which EcoDaLo 
publisher is used for the purpose of online advertisement, including awareness 
regarding participating publishers. Users then have to explicitly give consent for 
this purpose, i.e. they explicitly have to opt-in. EcoDaLo assumes that publishers 
and advertisers act with good faith following relevant ethical guidelines [22] 
which goes beyond the presented technical solution. 


7.2 Key Features of Our Approach 


Hiding the complexities of using semantic technologies increases the potential 
uptake by new consortium partners. Consortium partners are not confronted 
with RDF triples or ontologies but, instead, rely on developer-friendly for- 
mats such as CSV and JSON. The federated querying layer and interoperable 
machine-understandable model are made transparent, lowering effort for con- 
sortium partners and increasing chances of enlarging the consortium. Although 
explicit semantics are currently hidden for consortium partners, (future) advan- 
tages are gained, compared to using an ad-hoc integration layer. Unambiguous 
machine-understandable trait definitions increase interoperability, and make it 
easier for new members to join the consortium. Reasoning can be applied to auto- 
matically enrich knowledge graph: implicit links between common traits can be 
discovered. 


7.3 Future Work 


For future work, we investigate in a complementary validation component of 
the federated querying which i.a. can filter results which are too narrow and 
could harm privacy. Also, we look into the influence of using fine-grained traits 
and applying more advanced queries, a.o., taking into account a captured trait's 
confidence level. For example, when the trait “likes sports” is captured with a 
low confidence level by multiple publishers, this can be combined as a single 
“likes sports” trait with higher confidence. 
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* GDPRs definition of personal data also partially overlaps with protected character- 
istics related to discrimination and thus ethics: https:/ /www.equalityhumanrights. 
com/en/equality-act /protected-characteristics. 
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Abstract. The recent deployments of semantic web tools and the expan- 
sion of available linked datasets have given users the opportunity of build- 
ing increasingly complex applications. These emerging use cases often 
require queries containing mathematical formulas such as euclidean dis- 
tances or unit conversions. Currently, the latest SPARQL standard (ver- 
sion 1.1) only embeds basic math operators. Thus, to address this short- 
coming, some popular SPARQL evaluators provide built-in tools to cover 
specific needs; however, such tools are not standard yet. To offer users 
a more generic solution, we propose and share MINDS, a translator of 
mathematical expressions into SPARQL-compliant bindings which can be 
understood by any evaluator. MINDS thereby facilitates the query design 
whenever mathematical computations are needed in a SPARQL query. 


1 Introduction 


During the past two decades, semantic web technologies for the web have been 
developed and it is now possible to produce, share, analyze and interlink large 
knowledge graphs (sometimes containing billions of facts) structured using the 
RDF w3c standard [12]. Additionally, the W3C has standardized SPARQL [14], 
the de facto query language dedicated to RDF which has been more recently 
improved to add new features, see e.g. [19] for its current version. Furthermore, 
several projects have been created where SPARQL public endpoints are openly 
available to access data such as DBpedia [9] or YAGO [16]. As a consequence, 
to leverage these resources the Semantic Web community has been developing 
more and more complex use cases involving several endpoints which are then 
© The Author(s) 2020 


E. Blomqvist et al. (Eds.): SEMANTiCS 2020, LNCS 12378, pp. 104-117, 2020. 
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queried together using federated SPARQL queries to build or extract knowledge 
from combinations of multiple endpoints. In addition, these use cases some- 
times require the computation of mathematical formulas which combine values 
according to specific patterns, to either filter or return the results. However, in 
the current version of the standard! , only the four basic mathematical operators 
are available (+, —, *, /) and some basic predefined functions, such as CEIL 
or FLOOR. To address this lack in the standard, some popular evaluators allow 
extensions to the SPARQL language to cover popular mathematical functions 
(e.g. trigonometric operations). Nonetheless, this results in queries especially 
built to be executed in a specific system and which therefore cannot be shared 
among users. 

To gain in interoperability, we propose and share MINDS: a translator to 
embed Mathematical expressions INsiDe SPARQL queries. Our implementation 
is openly available under the terms of the Apache License version 2.0 from: 


https:/ /github.com/SmartDataAnalytics/minds 


MINDS translates the given mathematical expressions into a list of SPARQL- 
compliant bindings i.e. BIND((...)AS ?var). This approach allows thereby the 
obtained SPARQL queries to be executed by any kind of evaluator while facili- 
tating the task of query design. 

'The rest of this article is structured as follows. First, we review the related 
work in Sect. 2 and next, we motivate our approach with an example requiring 
mathematical formulas in Sect.3. Then, we describe MINDS in Sect. 4, before 
presenting in Sect. 5 some accuracy results about our methods and some com- 
parisons against existing SPARQL evaluators. In Sect. 6, we present various use 
cases implying the use of MINDS. Finally we conclude in Sect. 7. 


2 Related Work 


In this section, we provide an overview of the related work regarding mathe- 
matical formulas inside SPARQL queries. Due to the SPARQL standard lack- 
ing the specification of something essential as basic math functions’, different 
approaches have emerged to serve this need. 

In fact, some SPARQL evaluators do not give the possibility of computing 
mathematical functions inside queries at all. This is for instance the case with 
Astore [7], RDF3X [13] or SPARQLGX [5] which are nonetheless popular evalu- 
ators from the literature renown for their performance. However, arguably, the 
research focus of these systems was on optimization of joins and indexes and less 
on feature completeness. 

Currently, all practical relevant SPARQL evaluators offer the opportunity 
of computing mathematical functions inside the BIND elements and projections. 


1 https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#expressions. 

? Currently, the SPARQL 1.2 Community Group which aims to advance SPARQL 
functionalities, is describing several mathematical operators that could be added in 
the next iteration of the standard. https://github.com/w3c/sparql-12/. 


106 D. Graux et al. 


While the SPARQL standard defines the built-in functions as part of the syntax’, 
the widely adopted approach by evaluator developers is to take advantage of the 
Function Call rule, which allows arbitrary IRIs to be used as function names. 
Hence, function extensions typically require no changes to the SPARQL syntax. 
However, the lack of standardization implies two drawbacks: 


— Firstly, the namespaces, local names and signatures of functions may vary 
between SPARQL engines, which makes it tedious —if not prohibitive- to 
exchange backends. 

— Secondly, the means of computation of a function and therefore the results 
may differ between evaluators. 


All popular SPARQL evaluators —often used to serve public endpoints- such 
as Virtuoso [4], Jena-Fuseki [8], GraphDB* and Stardog? feature mathemati- 
cal functions, yet, using different IRIs. For instance, Virtuoso uses the bif: 
namespace, whereas Stardog reuses the XPath function namespace?. Using such 
an approach of naming differently similar function/operator’ implies a loss of 
interoperability, especially, it make the design of federated SPARQL queries far 
more complex. Finally, some evaluators implement GeoSPARQL [2] giving then 
access to spatial functions for use in SPARQL queries such as finding a distance 
or computing a convex hull. 

Compared with existing evaluators which provide sometimes built-in math- 
ematical functions, MINDS chooses to use approximations when necessary in 
order to remain fully compliant with the SPARQL language. 


3 Motivating Example 


To have a better understanding of when mathematics may be needed in SPARQL 
queries, we consider a use case based on the geographical position of fossils found. 
Having a dataset recording the found fossils, we want to list the fossils: 


a. found in the last ten years; 
b. located 100 km around a specific position; 
c. older than 1 000 years. 


For clarity reasons, we will consider a simplified dataset recording Cartesian 
positions, a ! ^C-ratio and the discovery year. Each fossil is then represented by 
an identifier using the following structure: 


3 https://www.w3.org/ TR/sparql11-query/ grammar. 

^ https: //ontotext.com/products/graphdb/. 

5 https://www.stardog.com/. 

6 https://www.w3.org/2005/xpath-functions/math#. 

T Implementations for built-in STDEV in Virtuoso, Fuseki, Stardog, Sesame: https:// 
gist.github.com/albertmeronyo/c6ab285d0b73b05392e2f0b8a5bbea82. 
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:fossilld  :type :fossil . 
:fossilId :abscissa "XXX" 
:fossilId :ordinate  "YYY" 
:fossilId  :foundIn "year" 
:fossilId :ci4rate "ratio" 


As a consequence, to list all the fossils, one might run this SPARQL query: 
SELECT ?f WHERE { ?f :type :fossil}. In the rest of this Section, we will 
refine step by step this query to add the restrictions specified above, emulating 
the process of a query designer. 


a — Found in the last ten years. This constraint implies the filtering of the 
records according to the year of their discovery. Considering that the current 
year is 2020, we will keep only fossils found after 2010 and we can ask: 


SELECT ?f WHERE 1 
?f :type :fossil . 
"Tf :foundIn ?Y ; 
FILTER( (2020-?Y) <= 10 ) } 


In this particular case, only a simple FILTER (involving a simple operation) is 
required to refine the join. 


b — 100 km around a position. Then, we want to return fossils found around 
a specific position whose Cartesian coordinates are (Px,Py). To do so, we have 
to compute Euclidean distances between this position and the fossils using the 
classic formula: d = y Aa? + Ay?. However, according to the standard, there 
is no square operator and no square-root. Obviously, we can escape from these 
issues easily by comparing d? instead of d. Our SPARQL query thus becomes: 


SELECT ?f WHERE { 
?f :type :fossil . 
"Tf :foundIn ?Y 
?f :abscissa ?x 
?f :ordinate ?y $ 
FILTER( (2020-?Y) <= 10 ) 
FILTER( ( (?x-Px)*(?x-Px) + (?y-Py)*(?y-Py) ) <= 100*100 ) } 


As one can see, the FILTER condition is getting longer —increasing the probability 
of errors and typos for example- and in this example we only deal with simplified 
data (for instance no unit conversions are needed). 


c — Older than 1 000 years. The last condition only retains fossils which are 
older than 1 000 years. However, the considered dataset does not share ages but 
instead !4C-ratios r of fossils which can be used using radiocarbon dating — 
considering the !^C half-life t,/5 i.e 5 700 years- to find the age t(r) according 


to the following formula: 
In(r) 
WS (40) Ap 
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This expression involves the natural logarithm which is, however, not part of 
the standard. Therefore, to compute this expression, the query designer has to 
approximate the logarithm, using for example a decomposition in series: 


+oo 


1 ST 2k+1 
my) 22^ spa lui 
Vy €]0,+00[, In(y) xo om 


The FILTER can now by written: FILTER(5700*?L0G/ (70.693) «21000) where 
?LOG is a variable embedding the logarithm approximation whose result quality 
depends on the number of terms used in the decomposition. Considering only 
the first three terms (k € [0,2]) and the !4C-ratio ?rate of fossils, we have: 


BIND(( (?rate-1)/(?rate+1) ) AS ?z ) 
BIND(( ?z ) AS ?tO ) 

BIND(( (1/3)*(?z*?z*?z) ) AS ?t1 ) 
BIND(( (1/5)*(?z*?z*?Tz*?z*?z)) AS ?t2 ) 
BIND(( 2*(?tO + ?t1 + ?t2) ) AS ?LOG ) 
FILTER (5700*?L0G/ (-0.693) <=1000) 


As a consequence, it appears that building this simple approximation for its 
first three terms already leads to a rather complicated query. 


Furthermore. As stated previously, the example has been simplified for the 
sake of clarity. Firstly, the series approximation should indeed involve more terms 
i.e. at least 5 (see Sect. 5 for more details about the approximation preciseness). 
Secondly, when dealing with geographical data on Earth, latitude and longi- 
tude coordinates are actually preferred to Cartesian ones. Thus, considering two 
points Pi (latı, lon) and P2(lat2, long), the distance d should be calculated using 
the Haversine formula to calculate the great-circle distance: 


a — sin? (22) + cos 1. cos p2. sin? (44) y latitude in rad: s 
c=2. atan? (ya, /1—a) ;and4 A longitude in rad: “SRo 
d=R.c R the Earth radius: 6371 km 


Thereby, to compute d with this formula, several non-standard functions are 
required: 7 trigonometric ones and 2 square-roots. If this very query were to be 
evaluated, the designer would have to write herself the multiple decompositions 
in series which would be tedious and a possible source of errors. In the next 
Section, we introduce MINDS: our solution to help query designers when dealing 
with mathematical expressions. 


4 MINDS: From a Math Formula to SPARQL Bindings 


To tackle this gap in the SPARQL standard, and to help query designers in their 
tasks, we developed a software called MINDS. In a nutshell, it allows users to 
input a mathematical expression and obtain -only using standard operators 
and keywords- the exact corresponding translation, or an approximation if a 
decomposition in series has to be involved. 
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Expression = Expression BinOp Expression 
Expression ** [:digit:] 

LeftOp Expression 

BiParamOp Expression , Expression 
( Expression ) 

? [:Alphanum:] 

[:digit:]. [:digit:] 

BinOp =+]|-|*|/ 

LeftOp = 1n | exp | sqrt 

sin | cos | tan 


BiParamOp = atan2 


Fig. 1. Grammar of the expressions understood by MINDS. 


Practically, we developed MINDS as an external software which can be run 
when designing queries. It is written in Python [18] and its core currently rep- 
resents about 500 lines of code. Technically, the given formula is parsed using 
a dedicated implementation of the popular Lex and Yacc tools [11] for Python 
named PLY®. Then, once the formula is split into tokens, the translating rules 
are applied recursively to generate the final result. For instance, considering 
again the example of Sect. 3, the “2020-?7Y” expression will be translated into: 


#math2sparql > 2020-?Y 
BIND ( ( FLOOR((2020-xsd:double(?Y))*100)/100 ) AS ?result ) 


Compared to the solution presented in Sect. 3, the actual binding is already more 
complicated: first, since it specifies that ?Y should be considered as a double; and 
second, since it truncates the result to keep only two digits of precision with the 
FLOOR keyword of the standard. Actually, this precision parameter can be set by 
the user in MINDS, for instance to 5: 


Zmath2sparql > precision = 5 
#math2sparql > 2020-?Y 
BIND ( ( FLOOR((2020-xsd:double(?Y))*100000)/100000 ) AS ?result ) 


Therefore, MINDS is still relevant to handle even simple expressions that are 
cumbersome to express in SPARQL such as the d? (i.e. a squared Euclidean 
distance) of the previous Section: 


#math2sparql > (?m-?Pz)**2 + (?y-?Py)**2 

BIND ( ( FLOOR(( 
(1*(xsd:double(?x)-xsd:double(?Px))*(xsd:double(?x)-xsd:double(?Px)))- 
(1*(xsd:double(7?y)-xsd:double(?Py))*(xsd:double(?y)-xsd:double(?Py))) 

)*100)/100 ) AS ?result ) 


8 Python Lex-Yacc repository: https:/ /github.com/dabeaz/ply. 
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+00 at Too " qe 
expr = 1 "T sing = 1 (-1) AEN 
a = (2k + 1)! 
--oo 2k+1 Too 2k 
1 r—1 kc 
Ing — 2. = > -1 
MS M aen exi) iud (-U Gy 
k=0 k=0 
-Foo -Foo 2t+1 k : 
1 1 $—1 sin x 
= t = 
Ve 2 (x 2t 1 Gs ADT cos x 
k=0 t=0 
y Too qe 
atan2(y, x) = 2. arctan ——— = arctan t = J (-1)* 


/ £? + y? +2 = 2k+1 
Fig. 2. Series currently used by MINDS. 


We furthermore describe in Fig.1 the grammar which is understood by 
MINDS. In particular, our translator is able to deal with the four basic operators 
of SPARQL (i.e. + - * /) extended with the power operator (** in MINDS) 
while respecting conventional priorities. Moreover, our solution also provides 
several translation rules to deal with mathematical functions e.g. trigonometric 
functions and even with functions of multiple variables e.g. atan2. Nonetheless, 
these additional functions are not part of the standard and have to be expressed 
only using allowed SPARQL operators: MINDS is then able to compute approx- 
imations to translate into bindings these functions. Indeed, it uses when neces- 
sary a series decomposition such as the ones listed in Fig. 2 and technically a new 
binding is generated for each series so that the query evaluator might be able to 
store the sub-result. For instance, considering x? + exp(y + 3z) which involved 
the computation of the exponential of a linear expression, MINDS returns: 


#math2sparql > ?X**2 + exp (?Y + 3 * ?Z) 

BIND ((0*(1)/1.0 #1 
+(1*(xsd:double(?Y)+3*xsd:double(?Z)))/1.0 #y+3z 
+(1* (xsd: double (?Y)+3*xsd: double (?Z)) 

*(xsd:double(?Y)*3*xsd:double(?2)))/2.0 # (32 
*(i*(xsd:double(?Y)-*3*xsd:double(?Z)) 
*(xsd:double(?Y)-*3*xsd:double(?Z)) 
*(xsd:double(?Y)-3*xsd:double(?2)))/6.0 # C2 
+(1*(xsd:double(?Y)+3*xsd:double(?Z)) 
*(xsd:double(?Y)+3*xsd:double(?Z)) 
*(xsd:double(?Y)+3*xsd:double(?Z)) 
*(xsd:double(?Y)+3*xsd:double(?Z)))/24.0 # (yt3z)" 

JAS ?subi) 

BIND ( ( FLOOR(( 

(1*xsd:double(?X)*xsd:double(?X)) +?sub1 # r? + sub, 
)*100)/100 ) AS ?result ) 


As expected, MINDS automatically converts the exponential part into an 
approximation using the classic series of the exponential (see Fig.2 for more 
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Fig. 3. Natural logarithm and its approximations. 
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Fig. 4. Cosine and its approximations. 


details); in this case only the first five terms of the series where considered. As 
it will be described in Sect. 5, the more terms are involved the more precise the 
results will be; nonetheless, it is also important to mention that MINDS allows 
query designers to choose as a parameter this number of terms. Moreover, it is 
able to understand any kind of combination using its recognized keywords and 
it generates recursively the sub-bindings when necessary. 


5 Precision Results 


First of all, we want to pinpoint that the bindings generated by MINDS offer the 
same orders of magnitude as the tested built-in functions in terms of execution 
times. Indeed, the evaluation of a BIND expression or the call to an internal 
method are both executed in sub-second times; for more details, we refer the 
reader to the end of this Section, where external links of running queries on 
various SPARQL endpoints are available. 

Moreover, since MINDS uses approximations based on series for some math- 
ematical functions (see Fig. 2 for details), we further describe in this Section the 
accuracy of such a method before comparing MINDS with built-in functions of 
popular SPARQL endpoints. 
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Fig. 5. Approximation drifts (first seven terms) from the theoretical functions. 


Accuracy. First, we should remind that the number of terms used in the series 
has an impact on the quality of the approximation. Here, we review the approx- 
imation of the natural logarithm In in Fig.3, and of the cosine cos in Fig. 4. 
In both cases, we draw the exact function as a reference in red, together with 
several approximations: in blue only the first term of the series, in orange the 
first three ones and in purple the first seven ones. Thereby, it is evident that by 
considering only the first seven terms already provides more than 95% of accu- 
racy for the logarithm in the interval [1,20] and the approximation for In(100) 
is still 80% accurate. However, with trigonometric functions (see e.g. the cosine 
in Fig. 4), more terms are required. Nonetheless, to tackle this problem, MINDS 
takes advantage of the periodicity of these functions and actually: 


1. adds an additional binding to represent an approximated value of 27 i.e. 
BIND((6.28318530718) AS ?2P); 

2. replaces the expression ?f inside the sin or the cos function with the remainder 
of the division of ?f by 27 i.e. (?£ - ?2P * FLOOR(?f/?2P)). 


'This method allows MINDS to stay within an interval in which the accuracy 
remains above 8096 with the first seven terms. More generally, in Fig.5, we 
present the drifts between mathematical functions and their respective approxi- 
mations using the first seven terms of their series. This representation allows the 
query designers to determine the intervals where the proposed approximations 
of MINDS still have an accuracy above a chosen threshold, letting them decide 
the appropriate number of terms in the series to be generated. 


Comparison with Built-in Functions. Since mathematical functions are not 
part of the SPARQL standard [19], most of the popular systems providing end- 
points have implemented their own versions of some functions (see Sect.2 for 
more details about these systems). In this study, we also present comparisons 
between MINDS approximations and the built-in functions from some of these 
systems, namely: Virtuoso? [4], GraphDB!? and JenaFuseki!! [8]. 


? https: //virtuoso.openlinksw.com/. 
10 https: //ontotext.com/products/graphdb/. 
11 https: //jena.apache.org/documentation/fuseki2/. 
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SELECT * WHERE { 
MALUES 7V { 0.1 0.20.5 1234567 8 9 10 20 50 100 } 
BIND (( (?V-1)/(?V--1.) ) AS ?ratio) 
BIND (( bif:log(?V) ) AS ?BuiltInLog ) 
BIND (( 2*?ratio ) AS ?0neTerm ) 
BIND (( 2*( ?ratio -- 
(1/3.) *?ratio*?ratio*?ratio-- 
(1/5.)*?ratio*?ratio*?ratio*?ratio*?ratio ) 
) as ?ThreeTerms ) 
BIND (( 2*( ?ratio -- 
(1/3.)*?ratio*?ratio*?ratio-- 
(1/5.)*?ratio*?ratio*?ratio*?ratio*?ratio-- 
(1/7.)*?ratio*?ratio*?7ratio*?ratio*?ratio*?ratio*?ratio-- 
(1/9.)*?ratio*?ratio*?ratio*?ratio*?ratio*?ratio*?ratio*?ratio*?ratio-- 
(1/11.)*?ratio*?ratio*?ratio*?ratio*?ratio*?ratio*?ratio*?ratio*?ratio* 
Tratio*?ratio-- 
(1/13.)*?ratio*?ratio*?ratio*?ratio*?ratio*?ratio*?ratio*?ratio*?ratio* 
Tratio*?ratio*?ratio*?ratio ) 
) as ?SevenTerms ) 


Fig. 6. Query involving Virtuoso's built-in |n and approximations for different values. 


In Table 1, we present the raw results of à SPARQL query which computes 
on Virtuoso for several values (?V): the built-in natural logarithm (?BuilInLog) 
using the bif: prefix and three bindings generated by MINDS varying the num- 
ber of terms involved i.e. one, three and seven (see Fig.6). We observe that 
the accuracy measured corresponds to the one expected theoretically (as drawn 
e.g. in Fig.3 and 4). This observation implies that the Virtuoso engine executes 
exactly the operations listed in the bindings (without rounding nor truncating). 

More generally, since the built-in functions are specific addons provided by 
the systems, the set of available mathematical functions may vary across them: 
for instance, GraphDB provides very specific functions such as “hypot(a, y)" 
which returns yx? +y? or *"IEEEremainder(z,y)" which is the remainder 
operation on two arguments as prescribed by the IEEE 754 standard. Fur- 
thermore, currently (without MINDS), the query designers have to tune their 
SPARQL queries for each evaluation engine. For example, we list here various 
syntaxes to evaluate a logarithm: 


# Virtuoso. 
SELECT * WHERE { BIND ((bif:log(1234))AS ?log) } 


# GraphDB. 
PREFIX f: «http://www.ontotext.com/spargl/functions/» 
SELECT * WHERE { BIND ((f:10g(1234))AS ?log) } 


# Fuseki2. 
PREFIX math: <http://www.w3.org/2005/xpath-functions/math#> 
SELECT * WHERE { BIND ((math:1og(1234))AS ?log) } 
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Table 1. Virtuoso’s built-in natural logarithm vs some MINDS bindings. 


?V |?BuiltInLog | 7OneTerm ?ThreeTerms ?SevenTerms 

0.1 | —2.30259 —1.636363636363636 | —2.148161762423083 | —2.28612550677627 
0.2 | —1.60944 —1.333333333333333 | —1.583539094650206 | —1.608934294900188 
0.5 | —0.693147 | —0.666666666666667 | —0.693004115226337 | —0.693147170256012 
1 0.0 0 0 0 

2 0.693147 0.666666666666667 | 0.693004115226337  |0.693147170256012 

3 1.09861 1 1.095833333333333  |1.098607062425422 

4 1.38629 1.2 1.375104 1.386202224193573 

5 1.60944 1.333333333333333 | 1.583539094650206 | 1.608934294900188 

6 1.79176 1.428571428571429 | 1.745899525991154 | 1.790187408711124 
7 1.94591 1.5 1.876171875 1.942329693525345 

8 2.07944 1.555555555555556 | 1.983078460261816 | 2.072740626022152 

9 2.19722 1.6 2.072405333333333 | 2.186225968329208 
10 | 2.30259 1.636363636363636 | 2.148161762423083 | 2.28612550677627 

20 | 2.99573 1.80952380952381 2.545790028209391 | 2.880218635963087 
50 | 3.91202 1.92156862745098 2.840323022038755 | 3.419833927257202 
100 | 4.60517 1.96039603960396 2.950171566927436 | 3.64870669515376 


Notice that it is possible to directly run these examples —based on the natural 
logarithm!?— on several systems, considering that these systems are used to 
provide public SPARQL endpoints by a number of popular services, some of 
which available at the following links: 


— Virtuoso on the DBpedia endpoint ©; 
— GraphDB on the FactForge endpoint c; 
— Fuseki2 on the ZBW Labs endpoint c. 


The three above hypertext links provides visualizations of the SPARQL queries 
and automatically compute and display the results. They provide similar results 
as the ones already presented in Table 1. 


6 Use Cases 


MINDSaims to be a generic tool which can be integrated into existing system for 
SPARQL parsing or mapping to different transformations. To this aim we are 
developing a number of use case implementations on different tools and systems. 
We group such use cases into two different categories: 


Integration 

SPARQL-to-SQL rewriter — Sparqlify. Sparqlify'® is a SPARQL-SQL rewriter 
that enables the definition of RDF views on relational databases and their 
12 More examples online from https://smartdataanalytics.github.io/minds/ where 


some other built-in functions are reviewed with other sets of values. 
13 https: //github.com/SmartDataAnalytics/Sparqlify. 
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querying using SPARQL [15]. MINDSis being used for mathematical transforma- 
tions into SPARQL bindings embedded into Sparqlify. Users will write SPARQL 
queries following the instructions represented by MINDSand then Sparqlify will 
take over the query rewriter into SQL syntax. 


Semantic Analytics Stack - SANSA. SANSA [10] is an open source! data flow 
processing engine for performing distributed computation over large-scale RDF 
datasets. It provides data distribution, communication, and fault tolerance for 
manipulating massive RDF graphs and applying machine learning algorithms on 
the data at scale. SANSA uses Sparglify as an underlying infrastructure for the 
integration of existing SPARQL-to-SQL rewriting tools. By doing so, it enables 
mathematical transformations as well via MINDSas a support add-on. 


Usability 

Blockchain — Alethio Use Case. Alethio!? is modeling an Ethereum analytics 
platform that endeavors to provide transparency over the transaction pool of the 
Ethereum network. Their 5 billion triple dataset contains large scale blockchain 
transaction data modelled as RDF according to the structure of the Ethereum 
ontology!?. Alethio has been using SANSA as a scalable processing engine for 
their large-scale data processing tasks, such as querying the data in real time 
via SPARQL and performing related analytics [6,17]. MINDS was used through 
SANSA integration and served as an easy-to-use mathematical function evalu- 
ator, such as time-series of the latest exchange values, average transaction size 
or even filtering some chains considering geometrical-mean of some included 
parameters. 


Geospatial Data — SLIPO. SLIPO! was an EU Horizon2020 project which aimed 
at developing linked data technologies for the scalable and quality-assured inte- 
gration of Big Point of Interest (POI) datasets [1]. SLIPO used SANSA as a 
scalable querying engine to deal with their large-scale POIs data [3]. In partic- 
ular, SLIPO aimed at discovering areas of interests using POI datasets which 
implies, for instance, searching road segments where amenities with some com- 
mon parameters are located. To do so, MINDS is being used there to filter POIs 
which are inside a convex hull. 


7 Conclusion 


In this article we introduced MINDS", a translator of mathematical expressions 
into SPARQL bindings. MINDS is also open source and shared on the Github 
platform which, in addition, provides us with the needed tools to manage an 


https: //github.com/SANSA-Stack. 

15 https://aleth.io/. 

16 https://github.com/ConsenSys/EthOn. 

17 http:/ /www.slipo.eu/. 

18 MINDS web-page: https://smartdataanalytics.github.io/minds/ which offers more 
information and details about the software such as for instance tutorials, running 
examples inside other SPARQL evaluators, accuracy charts. 
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open-source software i.e. a bug tracker, a way to integrate external contributions 
or also a release generator. We do hope this tool will help query designers in 
their tasks by providing in an instant the SPARQL compliant translation of 
complicated mathematical expressions, while giving them the ability of adjusting 
parameters in approximations. 
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Abstract. Semantic data integration from heterogeneous, distributed 
data silos enables Digital Humanities research and application develop- 
ment employing a larger, mutually enriched and interlinked knowledge 
graph. However, data integration is challenging, involving aligning the 
data models and reconciling the concepts and named entities, such as 
persons and places. T'his paper presents a record linkage process to rec- 
oncile person references in different military historical person registers 
with structured metadata. The information about persons is aggregated 
into a single knowledge graph. The process was applied to reconcile three 
person registers of the popular semantic portal “WarSampo — Finnish 
World War 2 on the Semantic Web". The registers contain detailed infor- 
mation about some 100 000 people and are individually maintained by 
domain experts. Thus, the integration process needs to be automatic and 
adaptable to changes in the registers. An evaluation of the record link- 
age results is promising and provides some insight into military person 
register reconciliation in general. 


1 Introduction 


A way to enhance our understanding about history is to integrate data from 
complementary information sources in an interoperable way. In record linkage 
(RL) [2,6,13], the goal is to find matching structured data records between het- 
erogeneous databases. A typical application scenario is matching person records 
in different person registers, which contain structured data about some same per- 
sons, but are expressed using different metadata schemas and notations. Using 
RL, richer global descriptions of persons can be created based on local datasets. 

This paper concerns the problem of entity reconciliation and RL of persons 
in military historical person registers. As a case study, three complementary 
datasets about some 100 000 Finnish Second World War soldiers in WarSampo 
[7,10] are considered. A probabilistic record linkage [6] solution for linking per- 
son records is presented, as well as promising evaluation results. The key idea is 


© The Author(s) 2020 
E. Blomqvist et al. (Eds.): SEMANTIiCS 2020, LNCS 12378, pp. 118-126, 2020. 
https://doi.org/10.1007/978-3-030-59833-4 8 


Integrating Historical Person Registers as Linked Open Data 119 


to assign weights to various comparisons of metadata fields between person reg- 
isters. The weights can tell us what information is important for disambiguating 
person records in the military history context. 

After the matches between registers are generated, information is aggregated 
into the actor ontology, which contains the identities and enriched metadata of 
each person. Integrating the person registers into a single knowledge graph (KG) 
facilitates biographical and prosopographical research [9]. 

The WarSampo KG is published as open data! and is part of the interna- 
tional Linked Open Data Cloud. The WarSampo portal? [7] demonstrates the 
usefulness of the resulting KG integrated from various sources. WarSampo uses 
Linked Data and the event-based CIDOC Conceptual Reference Model (CRM)? 
together as a basis for harmonizing various datasets about Finland in the Second 
World War. The portal provides nine customized interactive “perspectives” on 
the data: (war) Events, Persons, Army Units, Places, Magazine Articles, Casu- 
alties, Photographs, War Cemeteries, and Prisoners of War. Since its opening 
in 2015, the WarSampo portal has been used by more than 710 000 end users, 
corresponding to more than 1096 of the population in Finland. 


Related Work. Overviews of the RL field are presented in [6,13]. Antolie 
et al. [1] present a case study of integrating Canadian World War I data from 
three sources: one of soldiers, one of casualties, and a census dataset, using a 
series of handcrafted deterministic RL processes. Research use of the resulting 
longitudinal data is demonstrated. Cunningham [3] presents integrating a World 
War I veteran military service record with a census database using a determin- 
istic RL process and provides findings of quantitative analysis of the data for 
historical research. 

The Historical Population Register (HPR) of Norway is pursuing to cover 
the country's whole population in 1800-1964 combining information with RL 
from church records and censuses [12]. The Links project^ has similar goals in 
the Netherlands aiming to reconstruct all nineteenth and early twentieth century 
families in the Netherlands based on all civil certificates from this period. 


2 Data: WarSampo Person Registers 


In WarSampo, information about a single person can be found in multiple per- 
son registers, each bringing in some new information about the person. The 
information found from multiple sources can be combined to create a more com- 
plete biography of the person. However, it is challenging to reliably say whether 
two similar looking person records in different registers refer to the same actual 
person as they contain no common fields with shared unique values. 


1 https://doi.org/10.5281/zenodo.3431121. 

? http:/ /sotasampo.fi/en/. 

3 http:/ /cidoc-crm.org. 

^ Of. the project homepage https://iisg.amsterdam/en/hsn/projects/links and 
research papers at https://iisg.amsterdam/en/hsn/projects/links/links- publica 
tions. 
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The military rank and military unit of a soldier are prone to change in time 
due to promotions or even demotions. There can be different spellings of a name, 
middle names can be missing, and in Finland many originally foreign surnames 
were translated into Finnish in the early 20th century. In practise, the same full 
name can refer to different persons, and different names can refer to the same 
person. There are currently three different person registers in WarSampo: 


1. Initial Actor Ontology. The ontology containing 5600 people, and 
also military units, has been created from various data sources which provide 
varying levels of detail [11]. For most of the people there is rich biographical 
metadata, e.g. a person’s full name, the dates and places of birth and death, 
occupation, and dates of promotions during the military career. However, in 
some cases the level of detail is not sufficient for disambiguation, e.g., only a 
surname and military rank may be known. 

2. Register of Military Death in the Finnish Wars 1939-1945. The 
register contains 94 700 death records (DR) [8], depicting the status of the 
person at the time of his/her death. The spreadsheet source data contains 
detailed information about the known Finnish persons who perished in WW2. 
There are 32 columns of structured information about each person, with each 
cell having a single literal value. 

3. Register of the Prisoners of War in Soviet Union 1939-1945. The 
register contains 4200 prisoner records (PR) [9], depicting the status of per- 
sons at the time when they were captured. It was published in WarSampo on 
November 2019. The spreadsheet source data contains mostly very detailed 
information about each known Finnish prisoner of war. The spreadsheet con- 
tains 45 columns of information about each person, gathered from, e.g., var- 
ious archives. Often a single cell contains multiple values corresponding to 
information in different sources, following a pre-defined cell formatting. Most 
of the cells contain well-formed literal values, like the municipality of birth, 
military rank, and date of returning from captivity. 


3 Method: Linking Person Records 


The WarSampo KG is built from source datasets using a repeatable data trans- 
formation pipeline [10]. In this approach, the domain experts maintain the pri- 
mary data in the original native format, i.e., typically spreadsheets. When a 
source dataset is updated, the pipeline can be used to easily recreate the whole 
KG with the updated data. 

'The pipeline transforms the source spreadsheets of DRs and PRs into RDF, 
mapping the columns to RDF properties, with possibly multiple values per prop- 
erty. Automatic probabilistic entity linking processes then link the records to 
the WarSampo domain ontologies of military ranks, units, occupations, people, 
and places. This semantic reconciliation improves the interoperability [4] of the 
person registers. If the related domain ontologies are updated, the whole integra- 
tion process can be redone to account for the changes in the probabilistic entity 
linking. 
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The person record linkage is performed after linking the metadata values to 
domain ontologies. This is challenging because of heterogeneity of the metadata 
schemas, ambiguous metadata annotations, temporal changes, and errors in the 
data. Approximate similarity matches of metadata fields is often useful when 
working with noisy historical person records [1]. 

The two record linkage scenarios that are needed to tackle for integrating 
data from all three person registers are: 


RL1. DRs (94 700 person records) linked with the initial actor ontology (5600 
persons) 

RL2. PRs (4200 person records) linked with the actor ontology enriched with 
the DRs (99 667 persons) 


The first developed solution, applied in both scenarios, was a deterministic 
(or rule-based) RL, in which all person pairs were compared with each other, and 
scored based on a pre-defined handcrafted formula. This was manually evaluated 
to provide at least satisfactory results (precision estimated to be at least 0.9), 
but as the datasets were being updated and the ontologies evolving, manually 
maintaining the scoring formula was decided to be not sustainable. 

The second solution is to use probabilistic RL [6], with a logistic regression- 
based machine learning implementation employing the Dedupe Python library [5]. 
Results from the previous solution are used as training data, consisting of 216 
matches for RL1 and 1234 matches for RL2. Of these, the ones close to the match 
acceptance threshold were manually validated to be correct. Person instances or 
person records with only 3 or less metadata fields for the RL are ignored as too 
ambiguous in the linking process. The RL solution® is open-source, and is used in 
the transformation processes of the DRs® and the PRs’. A run of the probabilistic 
RL process completes within a few hours in both of the scenarios on an average 
desktop computer. 

The scoring of possible pairs between the PRs and the persons already inte- 
grated to WarSampo, i.e., initial actor ontology and DRs, are performed using 
the comparisons of properties shown in Table 1. The weighted sum of the individ- 
ual comparisons is used as a confidence that a given pair of records is a match, 
i.e., that it refers to the same real world person. If the weighted sum is above a 
given, manually fine-tuned threshold, the records are considered a match. The 
comparisons of type string use hyper-parameter optimization to find the best 
performing string comparison for the values, e.g., Jaro- Winkler. The intersection 
comparisons compare the one or more URI values of both records to see if there 
is a matching URI or not. The date comparisons measure the distance of two 
dates based on CIDOC CRM time-spans, which have separate earliest and latest 
dates. The numerical comparison measures the distance of numerical values. 

To address temporal changes in a person's military rank and the observed 
variance in the use of different private level ranks, a comparison based on the 


? https:/ /github.com/SemanticComputing/warsa-linkers. 
$ https: //github.com/SemanticComputing/Casualty-linking. 
T https: //github.com/SemanticComputing/WarPrisoners. 
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Table 1. Used metadata comparisons between the registers for the probabilistic RL. 


Property Comparison type Binary/Continuous variable 
Given names String Continuous 
Family name String Continuous 
Municipality of birth | Intersection Binary 
Date of birth [earliest] | Date Continuous 
Date of birth [latest] | Date Continuous 
Date of death [earliest] | Date Continuous 
Date of death [latest] | Date Continuous 
Municipality of death | Intersection Binary 
Activity end Date Binary 
Military rank Intersection Binary 
Military rank level Numerical Continuous 
Military unit Intersection Binary 
Occupation Intersection Binary 


comparative level of a rank is used. This also addresses the rather permanent 
separation between enlisted ranks and commissioned officers. 


Aggregating Personal Information. After the links of records between reg- 
isters are generated, information is aggregated into the actor ontology, which 
contains the identities and basic metadata of each person, with a data model 
based on CIDOC CRM. New person instances are created in the actor ontol- 
ogy for the records that did not match any existing person and existing person 
instances are enriched with new information. The person records are modeled 
as instances of CIDOC CRM’s document class, which are linked to the person 
instances in the actor ontology. 


4 Results and Evaluation 


The record linkage scenario RL1 results in 620 DRs linked to matching people 
in the 5611 pre-existing person instances, corresponding to 1196 of the people 
in the actor ontology. For the remaining 94 056 DRs, new person instances are 
created. 

'The RL2 scenario results in 1255 person records linked to matching people 
in the 99 667 pre-existing person instances, corresponding to 3096 of the PRs. Of 
the matches, 1234 already exist in the training data as the initial deterministic 
solution was already quite successful in matching the records based on an early 
version of the prisoner register. For 2945 PRs, new person instances are created 
in the actor ontology. 


Comparison Weights. The learned comparison weights depict what infor- 
mation is useful for disambiguating person records in the military history 
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context. The weights of the comparisons vary a little as new runs on updated 
data are done, but their general magnitude is stable. For the newest WarSampo 
data transformation, the comparison weights in the RL2 scenario in descending 
importance order are: family name (2.3), municipality of birth (2.0), given names 
(1.4), date of birth earliest (1.2), birth date latest (1.2), military rank (1.0), occu- 
pation (0.9), military unit (0.8), military rank level (0.8), municipality of death 
(0.4). The remaining comparisons have weight under 0.1. 

Names, municipality of birth and date of birth are intuitively very important 
personal details defining a persons identity. As the date of birth is split into 
two comparisons, it’s overall importance can be summed up to 2.4, making it 
the single most important metadata field. The summed weight of military rank, 
1.8, is higher than that of given names. Military unit is also important, nearly 
as much as a person’s occupation. Occupation of soldiers probably have not 
changed during the war, but what is considered a persons occupation might 
vary depending on the situation and accountant. 


Linking Quality. Due to the mostly rich data of each person contained in the 
person registers, manual evaluation of found links is usually possible, by examin- 
ing the data in detail. This enables estimating the RL precision. Recall evaluation 
however, would need manual inspection of a very high amount of possible pairs, 
of which some have very little information. Also, the DRs are known to contain 
plenty of errors. Hence, it is in many cases difficult to confidently determine the 
true negative results, i.e., the cases where there is no match, which is crucial for 
the recall evaluation. However, manual inspection of matches that almost met 
the matching threshold were either ambiguous or false, suggesting that the recall 
is adequate. 

The precision of the record linkage in both scenarios RL1 and RL2 was 
manually evaluated to be 1.00, based on randomly selecting 150 links from the 
total of 620 links for RL1, and 200 links from the total of 1397 links for the RL2. 
'The information on the person records and the person instances was compared, 
and all of the linked records were interpreted to be depicting the same actual 
persons with high confidence. 


Using the Aggregated Information. The aggregation of information from 
multiple sources provides more full soldier biographies than when using indi- 
vidual sources. For example, the PRs fill a gap that would otherwise exist for 
each of the captured soldiers by providing, e.g., detailed information about their 
movements between prison camps. 

There are also person related documents that are linked to the person 
instances or their military units, i.e., a large collection of wartime photographs, 
hand-written digitized war diaries, and war veteran magazine articles. T'hese eas- 
ily provide further information for people studying for example the war paths of 
their relatives. 

'The Persons perspective of the WarSampo portal uses the aggregated per- 
son instances and information directly from the linked person records to cre- 
ate a unified view of all the information of each person, in a sense creating a 
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“homepage” for them.? In addition to showing the aggregated information, links 
are provided to related documents as well as related military units and people. 


5 Discussion 


'This paper presented the probabilistic record linkage process used in WarSampo 
to integrate heterogeneous person registers into a reconciled KG, which uses 
training data created by a simpler deterministic RL solution. The solution is 
capable of automatically handling updates in the person registers or related 
domain ontologies. The aggregated information can be used for, e.g., biographi- 
cal or prosopographical research by historians, or for study and exploration by 
interested citizens. 

'The weights of different metadata field comparisons, assigned using logistic 
regression, shed light on what metadata fields are useful in disambiguating per- 
son references in the military history context. Military rank and military unit 
are both important person details when determining the identity of a person 
depicted in a person record. 

The data is published openly on SPARQL endpoint and on the WarSampo 
portal, where anyone can evaluate the links between different person records as 
they are modeled as separate resources in the data and information sources are 
shown to users. The Persons perspective of the portal displays all information 
about a single person in the KG. The Casualties and Prisoners perspectives pro- 
vide faceted search and visualizations to explore, study, and analyze the DRs and 
PRs, respectively. In the future, a similar perspective for the aggregated person 
instances would be useful, where a user can conduct similar prosopographical 
analysis over all the persons. 

The solution is scalable and can be further used to integrate more person 
registers into WarSampo. For considerably larger person registers, a blocking 
strategy [2] based on the metadata values should be adopted to reduce the num- 
ber of comparisons. The presented approach is applicable also to other studies 
integrating historical person registers. A simple deterministic RL process can 
be useful for creating training data for a probabilistic RL process in similar 
scenarios where the process needs to be able to handle regular data updates 
automatically. 

In the future, a register of the soldiers who survived the war would be a 
valuable addition to WarSampo, providing the means to study subjects such as 
what affects the soldiers' likelihood of surviving the war. 


Acknowledgements. Our work has been funded by the Association for Cherishing 
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5 Cf. an example person “homepage” at https://www.sotasampo.fi/en/persons/ 
person. 65. 
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